DNA extraction and DNA library construction
The C. tenuissimus strain NIES-3715 was isolated from Seto Inland Sea, Japan (Supplementary Fig. 6) in Aug 2002. To check for bacterial contamination, the cultures were observed using epifluorescence microscopy after staining with SYBR-Gold. Briefly, the lysate was fixed with glutaraldehyde at a final concentration of 1%, and SYBR-Gold (Thermo Fisher Scientific, Waltham, MA, USA) was added to each fixed sample at a final dilution of 1.0 × 10− 4 of the commercial stock. The stained samples were filtered onto 0.2-µm polycarbonate membrane filters (Nuclepore membrane; Cytiva, Sheffield, UK), after which the filters were mounted on a glass slide with a drop of low-fluorescence immersion oil, and covered with another drop of immersion oil and a cover slip. The slides were viewed at 1000× magnification with an Olympus BX50 epifluorescence microscope. The axenic algal cultures were grown in a modified SWM3 medium enriched with 2 nM Na2SeO3 51 under a 12/12-h light-dark cycle at 20°C. Light irradiance was 850 µmol m− 2 s− 1 using white LED illumination. The algal strain was cultured for 7 days. Approximately 3 × 106–5 × 106 cells/ml in the stationary phase were used for DNA extraction. The cells in the cultures were harvested by centrifugation at 860 × g and 4°C for 15 min, after which the cell pellets were stored at − 80°C until analysis. DNA was extracted from the samples using a DNeasy Plant Mini Kit (Qiagen, Valencia, CA, USA), according to the manufacturer’s instructions. DNA libraries for paired-end and mate-paired sequencing were constructed in accordance with KAPA Hyper Prep Kit (F. Hoffmann-La Roche Ltd., Basel, Switzerland) and Nextera Mate Pair Sample Prep Kit (Illumina, Inc. San Diego, CA, USA), respectively. These libraries were sequenced into 300 bp paired-end reads using MiSeq (Illumina, Inc.) at the Japan Agency for Marine-Earth Science and Technology, Yokosuka, Japan.
For long-read sequencing of genomic DNA using MinION (Oxford Nanopore Technologies, Oxford, UK), total nucleic acid was extracted from the pellet using the DNAs-ici!-F (RIZO Inc., Tsukuba, Japan), according to the manufacturer’s protocol. To extract genomic DNA from the total nucleic acid sample, the sample was treated with RNase A (Nippon Gene, Tokyo, Japan) and subsequently purified with phenol/chloroform prior to construction of a DNA sequencing library. The sequencing library was constructed using the ligation sequencing kit (SQK-LSK109, Oxford Nanopore Technologies) and sequenced by MinION, according to the respective manufacturer’s instructions. After sequencing, base-calling was performed with Albacore (v2.3.1, Oxford Nanopore Technologies).
De novo genome assembly
To estimate the genome size and heterozygosity, k-mer counting was performed using the short paired-end reads and the Jellyfish programme52. The histogram of 21mer counts was visualised using GenomeScope53.
Hybrid assembly of all Illumina short reads and MinION reads was performed using MaSuRCA54 (v3.3.0) with default parameters. The haploid genome sequence was constructed from the assembled genome using HaploMerger255 (v20180603) with default parameters. The assembly quality was evaluated by QUAST56. To evaluate the assembly accuracy, single-copy ortholog genes were searched using BUSCO32 with alveolata_stramenophiles_ensembl datasets.
Gene prediction and annotation
To predict the gene regions in the genome sequence, we first obtained the complete open reading frames (ORFs) from RNAseq (DNA Data Bank of Japan (DDBJ) Sequence Read Archive under accession number DRA011082). In RNAseq, the de novo assembly procedure was performed following that described by Hongo et al57. The complete ORFs were extracted from the assembled sequences and translated using TransDecoder58 with default parameters. Next, the quality controlled paired-end reads of RNAseq were mapped to the assembled genome using TopHat259 with default parameters. Using the mapping data and the protein sequences of complete ORFs, the gene model was predicted using BRAKER260 (v2.1.0) with default parameters. Proteins predicted from the gene model were annotated based on their homology to sequences in the nr database from NCBI using the BLASTP programme with a threshold e-value of < 1e-5, and protein domains were found using Interproscan with a threshold e-value of < 1e-5.
Confirmation of an EVLF in the nuclear genome
To analyse the EVLF in the C. tenuissimus genome, we used nine other strains of this diatom species other than strain NIES-3715 (Supplementary Fig. 6). One millilitre of a stationary growth phase C. tenuissimus culture was centrifuged at 17,400 × g for 3 min at 4°C. The resulting diatom cell pellets were then preserved at − 80°C until analysis. DNA was extracted from stored cell pellets using the DNeasy Plant Mini Kit (Qiagen), according to the manufacturer’s instructions. The EVLF was amplified using a primer pair, ctEVLFout_v1_F: 5′-GCAAACACGTKTGTTGATATATCGG-3′ and ctEVLFout_v1_R: 5′-CGATCCTCTTGAAGACCCAGT-3′, (Fig. 1). PCR amplification was conducted in a reaction mixture with a 20 µl final volume, containing 0.5 µl DNA, 1 × BlendTaq buffer (Toyobo, Japan), 200 nM dNTPs, 0.2 µM of each primer, and 1 U BlendTaq DNA polymerase. PCR was conducted using GeneAmp PCR system 9700 with the following cycle parameters: 30 cycles of denaturation at 94°C for 30 s, annealing at 55°C for 30 s, and extension at 72°C for 30 s. The PCR products were then electrophoresed on 1% (w/v) agarose ME gels (Wako Pure Chemical Industries, Osaka, Japan), and the nucleic acids were visualised using Midori green nucleic acid stain (Nippon Genetics, Tokyo, Japan). PCR amplicons of approximately 1.5 kb were excised, and their nucleic acids were extracted (NucleoSpin® Gel and PCR Clean-up; Macherey-Nagel GmbH and Co., KG, Düren, Germany). The PCR products were ligated into the pGEM-T Easy vector (Promega, Madison, WI, USA) and transformed into Escherichia coli DH5α-competent cells (Toyobo, Japan). Sequencing was conducted using the dideoxy method with ABI PRISM 3130 Genetic Analyzer (Thermo Fisher Scientific).
RT-PCR analysis
The axenic algal cultures of C. tenuissimus strain NIES-3715 were grown in a modified SWM3 medium under a 12/12 h light-dark cycle of ca. 500 to 600 µM of photons m− 2 s− 1, using cool white, fluorescent illumination at 25°C for 3 days. For the RT-PCR analysis, preconditioned cultures were inoculated into 1-l of fresh SWM3 medium at a final density of 2.5 × 103 cells/ml using a 2-l polycarbonate Erlenmeyer flask (431255; Corning Inc, Glendale, AZ, USA). This experiment was performed in triplicate. The cultures were subsampled at the early logarithmic growth phase (day 1 and day 2) and at late logarithmic growth phase (day 4 and day 7). One each sampling day, diatom cells in the cultures were retained on 0.4-µm polycarbonate membrane filters (Nuclepore membrane; Cytiva). The number of diatom cells on the filters ranged from 107 to 108 cells per filter, which were frozen in liquid nitrogen and stored at − 80 ºC until analysis.
The retained filters containing the diatom cell samples were cut into small pieces in the TRIzol reagent (Thermo Fisher Scientific), and total RNA was extracted using a TRIzol Plus RNA Purification Kit (Thermo Fisher Scientific), with PureLink DNase (Thermo Fisher Scientific) digesting any contaminating DNA, in accordance with the manufacturer’s instructions. Moreover, to completely digest any contaminating DNA, DNase treatment was performed using TURBO DNase free kit (Thermo Fisher Scientific). The quantity of the total RNA was measured using a Qubit RNA HS assay kit (Thermo Fisher Scientific). cDNA was constructed from 1 µg of total RNA using an oligo(dT)15 primer and SuperScript IV Reverse transcriptase (Thermo Fisher Scientific), in accordance with the manufacturer’s instructions.
The EVLF sequence was amplified from the constructed cDNA using Ex Taq hot start version (Takara, Shiga, Japan) using the following conditions: initial denaturation phase of 98 ºC for 1 min, followed by 30 cycles of 98 ºC for 10 s, 60 ºC for 30 s, and 72 ºC for 40 s. Actin was also amplified using the cDNA and the total RNA to be used as a positive and a negative control, respectively. The primers used for this analysis were as follows: ctEVLFin_v1_F, 5′-AAGAAGAAGAGTCGACTGGATCAAC-3′; ctEVLFin_v1_R, 5′-ACAATAACGGTCTCATGATTGAGC-3′; ctActin_F, 5′-CTGGATGTGTTCTTGATTCTGGAG-3′; ctActin_R, 5′-CTTAGACATACGCTCACTGATTCCTG-3′. The amplicon lengths using these primers pairs were 456 bp for the EVLF and 500 bp for actin.
Phylogenetic analysis of EVLFs and virus replication-associated genes
The genome sequence of EVLF was obtained for nine strains of C. tenuissimus (Supplementary Fig. 6) using the above genome sequence confirmation. To clarify the evolutionary relationship of the EVLFs, a maximum-likelihood (ML) tree analysis was conducted. First, comparing within the strains, the EVLF protein sequences, including all alleles, were aligned with a replication-associated protein from C. tenuissimus DNA virus SS12-43V (accession no. BBE21064.1) using MAFFT61 (v7.212) with default parameters. All stop codons and the frame-shifted amino acids in the alignment were removed by manual editing. The best-fit evolutionary model for the optimum alignment was calculated using ModelFinder62 and the Akaike information criterion. The ML tree was inferred from an evolutionary model using RAxML63 (v8.2.4) with 100 bootstrap replicates. Second, to compare with virus replication-associated proteins, the virus protein sequences in the NCBI database were retrieved based on similarity to the EVLF using the BLASTP programme. These accession numbers are shown in Supplementary Table 4. All retrieved protein sequences and the EVLF were aligned using MAFFT61 with default parameters, and gaps were automatically trimmed using trimAl64 using the ‘-automated1’ command option and default settings for all the other options. The subsequent procedure was the same as that described above.
Identification of orthologous genes
Protein sequences from T. pseudonana, P. tricornutum, and Cyanidioschyzon merolae were retrieved from public databases (accession no. GCF_000149405.2, GCF_000150955.2, and GCF_000091205.1, respectively.) Orthologous gene groups in all the protein sequences, including those in C. tenuissimus, were found using OrthoFinder65 with default parameters. Protein domains in the sequences of reference organisms were found using Interproscan using a threshold e-value of < 1e-5.
Prediction of transposable elements
Transposable elements (TEs) in the C. tenuissimus genome were predicted using RepeatModeler266 and RepeatMasker67 programmes with default parameters. To compare the TEs statistically among diatoms, the genomes of T. pseudonana, P. tricornutum, and F. solaris were analysed using the same programmes and parameters and compared to TEs that have already been reported for these three genomes68, 69.