Genome assembly
We sequenced the first homosporous lycophyte genome of L. clavatum L. using a combination of Illumina and Pacbio high-throughput sequencing strategies. We generated 176.13 Gb (76× in depth) clean reads from Illumina and 235.46 Gb (102× in depth) subreads from Pacbio sequencing platforms (Tables S1 and S2). We initially analyzed the genome by K-mer=17, and the estimated heterozygosity and repetitive sequences in the genome were 0.89% and 72.92%, respectively, indicating an extraordinarily complex genome (Table S3). The genome size of L. clavatum was estimated to be 2.53 GB (Table S3, Fig. S1), slightly smaller than the DNA content of 2.86 pg/C by flow cytometry (Hanson & Leitch, 2002). We totally assembled 7,228 contigs, and after decontamination (see Method, Fig. 1A), we got an accurate L. clavatum genome sequence with 7,102 contigs, 2.30 Gb altogether, and contig N50 length is about 1.01 Mb (Fig. 2A, Table 1).
Different strategies were used to assess the quality and completeness of the L. clavatum genome. First, 84.60% of Illumina paired-end reads were successfully mapped to the assembly genome, and all mapped reads accounted for 93.79% (Table S4). Additionally, core eukaryotic gene mapping approach (CEGMA) analysis indicated that 73.39% (338) of core eukaryotic genes were detected in the L. clavatum genome (Table S5). Then, Benchmarking Universal Single-Copy (BUSCO) analysis showed that 75.46% of the 1,614 core embryophyte genes were present in the L. clavatum genome, and 69.33% of them were single-copy genes (Table S6). We also used the LTR Assembly Index (LAI) to evaluate the assembly quality (Ou et al. 2018), and the LAI score of L. clavatum genome is 20.27 (Table S7).
Genome annotation
We used multiple de novo and homology-based prediction procedures to annotate repetitive elements. An extraordinarily high proportion (85.32%, 1.97 Gb) in the L. clavatum genome was identified as repetitive elements, including Class I (retrotransposons, 76.42%), Class II (DNA transposons, 7.71%), potential host genes (0.37%), simple sequence repeats (0.29%), and unclassified elements (5.32%) (Table S8). Copia and Gypsy superfamily of LTR accounted for 36.25% and 26.09%, respectively.
The gene model prediction combined with homologous strategy, ab initio strategy, and transcript evidence-supported strategy. A total of 23,292 genes (distributed in 3,775 contigs) were predicted, the average gene and exon lengths were 13,019.30 bp and 420.33 bp, respectively (Fig. 2b, Table 1 and S9). (Non-redundant protein sequences, NR) homologous species distribution exhibited that the L. clavatum genome was the closest to that of S. moellendorffii (6,684, 32.81%), followed by that of Physcomitrella patens (4,386, 21.53%) (Table S10, Fig. S2). Gene function enrichments of GO mostly focused on ATP binding, membrane, and nucleus (Table S11). Gene function enrichments of COG and KOG mostly focused on replication, recombination, repair, and posttranslational modification, protein turnover, and chaperones (Tables S12 and S13, Fig.s S3 and S4,). Furthermore, 4,567 tRNAs, 530 rRNAs, 24 miRNAs, and 5,771 pseudogenes were predicted in the L. clavatum genome (Table S14).
Discovery and removal of non-plant sequences
The total GC content of the L. clavatum genome is 38.42% (Table 1), while some contigs exhibit abnormally high GC content. Interestingly, an obvious breakpoint was found between 0.52 and 0.57 (Fig. 1b, Table S15). Forty-nine contigs with GC content higher than 0.52 were compared with the Nucleotide Sequence Database (NT) in National Center for Biotechnology Information (NCBI), and we found that these contigs only hit non-plant species, such as fungi, bacteria, and other distant species (-max_target_seqs 1, -evalue 1e-5) (Table S16). Furthermore, in order to verify the reliability of high GC content as non-plant sequences, we calculated the GC content of alignment regions (non-plant sequences) and non-alignment regions (plant sequences) in the contigs with abnormal GC content, and there was no difference for the GC content of the alignment region and the non-alignment region in the same contig (Table S17). Therefore, non-plant sequences and L. clavatum genome sequences could not be separated just by the abnormal GC content.
We assumed that non-plant sequences and the L. clavatum genome sequences were unable to be assembled into one contig, and horizontal gene transfer could be assembled into the L. clavatum genome sequences. We further compared all 7,228 contigs with the NCBI NT database and found that 131 contigs (including 49 contigs with abnormal GC content) were the best alignment for non-plant species which were regarded as the potential non-plant contigs. Then, we compared 601 genes on the 131 contigs with the NT database and found one contig with horizontal transfer genes and five false positive contigs which only mapped to plant gene sequences. Eventually, we got 126 contigs as non-plant sequences with a total length of 6.56 Mb (0.28% of the whole genome) and 585 genes (in 99 contigs) (Table S18). Function enrichment of the non-plant genes mainly involving in energy production and transmembrane transport mediated by ATP-binding cassette transporters (Tables S19 and S20), which indicates that the endophytes may provide the carbon sources to the gametophytes of Lycopodiales (Horn et al. 2013; Szövényi et al. 2021).
Comparative analyses of gene family
We investigated the gene families in the early diverged land plants including P. patens, L. clavatum, I. taiwanensis, S. lepidophylla, S. moellendorffii, and S. tamariscina. We clustered 22,610 orthogroups, accounting for 84% of all six genomes (Table S21), 8,534 orthogroups of L. clavatum shared with other plants, and 877 species-specific orthogroups appeared to be unique in L. clavatum (Table S22). The unique genes in L. clavatum were enriched in posttranslational modification, protein turnover, chaperones, secondary metabolites biosynthesis, transport and catabolism, and transcription (Tables S23 and S24).
Comparative genomics exhibited that the common ancestor of lycophytes had 9,350 gene families. For the L. clavatum genome, it showed the least loss and the least gain from the common ancestor (Fig. 2c). The heterosporous lycophyte ancestor included 9,937 gene families, with only a small number of gain (839) and minimal loss (252) during the evolution from the common lycophyte ancestor node to the heterosporous lycophyte node (Fig. 2c).
The processes of LTR-RTs’ accumulation
It has long been recognized that the abundance of LTR-RTs is positively correlated with the genome size, which is also applicable to lycophytes. There are three types of LTR-RTs: intact LTR-RTs (I, with complete Gag-Pol protein sequence), solo-LTR-RTs (S, LTR-RTs paralogs that lack any Gag-Pol homologs in both the upstream and downstream sequences) and truncated LTR-RTs (T, LTR-RTs with Gag-Pol sequences on one side of flanking sequences), and S:I values indicate the deletion rate of LTR-RTs which are particularly prone to removal when the value is higher than 3 (Lyu et al. 2018). We identified all of three types LTR-RTs in the L. clavatum genome and three published heterosporous lycophyte genomes (I. taiwanensis, S. tamariscina, and S. moellendorffii). For homosporous lycophyte, it exhibited a much greater birth (I+S+T) (Lyu et al. 2018) and extremely slower death (S:I<1) model of LTR-RTs, resulting in the largest accumulation of LTR-RTs and expansion of the L. clavatum genome (Fig. 3ab, Table S25). For heterosporous lycophytes, all have a lower birth and faster death (S:I, 2.81 to 5.60) model of LTR-RTs, resulting in a small accumulation of LTR-RTs and relatively small genomes in heterosporous lycophytes (Fig. 3ab, Table S25). We detected a positive correlation between LTR-RTs and the genome sizes in lycophytes (Fig. S5). All the three types of LTR-RTs exhibited a positive correlation coefficient with the genome sizes (R2 = 0.98, 0.68,0.67; F-test, P-value = 1e-3, 3.08e-5, 9.03e-4), while S:I ratios showed a negative correlation with the genome sizes (R2 = 0.29; F-test, P-value = 1.43e-8, Fig. S5).
In addition, we examined the insertion time of intact LTR-RTs. We found that the insertion time of most LTR-RTs in lycophytes is very young (< 1 Mya) (Fig. 3c). For homosporous lycophyte L. clavatum, the concentrated outbreak of LTR-RTs was around 0.11 Mya, and the insertion of 80.68% LTR-RTs was in one Mya; For heterosporous lycophyte, I. taiwanensis, it had a continuous LTR-RTs insertion in 0.16-1 Mya, and the number of LTR-RTs during this period accounted for 70.76%; For another heterosporous Selaginella, LTR-RTs showed the trend of multiple insertions in different periods during six Mya. The highest peak of insertion time was 0.20 Mya in S. tamariscina, which was 0.25 Mya later than that of in S. moellendorffii (Fig. 3c). Furthermore, we found a sharp contrast of LTR-RTs birth and death model (Fig. 3d). For homosporous lycophyte, it experienced a much greater birth and extremely slower death of LTR-RTs, while the heterosporous lycophytes possessed a lower birth and faster death model (Fig. 3d).
Two specific whole-genome duplications were detected in the Lycopodium clavatum genome
Using a joint strategy of synonymous substitutions per synonymous site (Ks) and 8,498 phylogenetic trees, we identified at least two ancient whole-genome duplications (WGDs) in the L. clavatum genome. First, we tried the syntenic analysis which is the most direct and powerful gold standard for identifying WGDs (Jiao et al. 2914; Wang et al. 2018), but it is difficult to find the collinearity of the L. clavatum genome (Fig. S6). Actually, we only identified 100 WGD gene pairs (197 genes) in 18 syntenic genomic segments and 16,256 dispersed duplicated gene pairs (18,778 genes) using DupGen_finder (Tables S26 and S27) (Qiao et al. 2019). A large number of dispersed duplicated genes and a small number of WGD genes suggest that there may be polyploidy in L. clavatum genome. Based on the analysis of 8,498 low-copy homologous gene trees by tree2gd (https://sourceforge.net/projects/tree2gd/), we found that the proportion of duplicated genes reached 35.69%, which indicated that there were WGDs in L. clavatum (Fig. 2c, Table S28).
In addition, we made Ks statistics for the two best-matched gene pairs produced by Blastp, removed the gene pairs on the same contig (eliminate the influence of tandem repeat genes as far as possible), and finally obtained 19,583 gene pairs, a total of 17,971 genes, distributed on 3,570 contigs. Two obvious peaks were detected (Table S29, Fig. S7), and Ks values of the two peaks (alpha and beta) were 0.45 (σ = 0.14) and 1.27 (σ = 0.67), respectively (Fig. 4a, Table S30), revealing that there were two WGDs in L. clavatum genome (Fig. 2c).
In order to infer the relative position of the two WGDs, we corrected the evolutionary rate with P. patens as the outgroup. The Ks peaks of differentiation between P. patens and L. clavatum, P. patens and I. taiwanensis, and L. clavatum and I. taiwanensis were 1.66 (σ = 0.82), 1.59 (σ = 0.81) and 1.54 (σ = 0.72) respectively (Fig. 4a, Table S30). The evolution rate of L. clavatum was 4.15% which was faster than that of I. taiwanensis. After correction, the alpha and beta Ks peaks of L. clavatum were 0.44 (σ=0.14) and 1.22 (σ=0.65), respectively, and the Ks peak of differentiation between L. clavatum and I. taiwanensis was 1.51 (σ= 0.70) (Fig. 4b, Table S30). Therefore, the two specific WGDs occurred after the differentiation between L. clavatum and I. taiwanensis and affected the formation of L. clavatum genome.
Pathway evolution of Huperzine A biosynthesis in land plants
It is known that many species of Lycopodiaceae yield HupA. To detect the integrality of the function module of HupA biosynthesis in L. clavatum (Fig. 5a), we analyzed the key enzyme genes known to involve in HupA biosynthesis in the genomes of L. clavatum and other six species (P. patens, I. taiwanensis, S. moellendorffii, S. tamariscina, S. lepidophylla, and Arabidopsis thaliana). Our result showed that only L. clavatum had all five types of enzymes (Fig. 5b, Table S31). Although all of the investigated species had the PKS, BBE, and SLS, both P. patens and A. thaliana were short of LDC which restricts the first step of the biosynthetic route, suggesting the acquisition of LDC occurred in the ancestor of lycophytes; I. taiwanensis and other three Selaginella species lacked CAO which restricts the second biosynthetic step, indicating CAO has been lost in the ancestor of heterosporous lycophytes. These results indicated that the HupA biosynthetic pathway in homosporous lycophyte was most complete compared with moss and the other vascular plants including the heterosporous lycophytes. The key innovation of the pathway evolution in homosporous lycophytes might be crucial for HupA biosynthesis.