Reference genome sampling and sequencing
Two Styela plicata individuals were collected in the harbor of Barcelona in September 2020 (Extended Data Table 1, Fig. 1). One of the individuals was kept alive until DNA extraction to best preserve DNA integrity, while the other was immediately preserved in RNAlater for RNA extractions. Once in the laboratory, two gill folds were excised for DNA extraction and genome size estimation, while 25 mm2 each of gill, mantle, and gonad tissues were selected for RNA extractions.
DNA extraction followed a protocol based on that of Mayjonade et al. (2016) 53. RNA extractions were performed according to the protocol of Ghangal et al. (2009) 54. The quality and concentration of both extractions were assessed using the TapeStation 2200 (Agilent Technologies) and the Qubit Fluorometer with the appropriate Qubit dsDNA/RNA BR Reagents Assay Kit (Thermo Fisher Scientific, Waltham, MA).
A SMRTbell library was constructed following the instructions of the SMRTbell Express Prep kit v2.0 with Low DNA Input Protocol (Pacific Biosciences, Menlo Park, CA). One SMRT cell sequencing run was performed in CLR mode on a Sequel System II with the Sequel II Sequencing Kit 2.0. Additionally, a DNA extract of the same specimen was shipped to Novogene (UK) for Illumina Short Reads (SR) Whole Genome Sequencing (WGS). One genomic library (insert size: 350 bp) was prepared and 150 bp paired-end reads were sequenced on an Illumina NovaSeq 6000 platform (San Diego, CA) targeting 30 Gb output. Omni-C short reads were obtained by building the corresponding libraries following DovetailⓇ Omni-C kit manufacturer's instructions with an insert size of 350 bp. The library was sequenced on the NovaSeq 6000 platform using a 150 paired-end sequencing strategy at Novogene (UK). Finally, the RNA extractions from gill, mantle and gonad tissues were pooled at equal concentrations and sent to Novogene (UK) for Illumina paired-end 150 bp RNA-seq of a cDNA library (insert size: 350 bp) with an expected output of 30 Gb (Extended Data Table 3).
Genome size estimations
Genome size was estimated following a flow cytometry protocol with propidium iodide-stained nuclei described in Chueca et al. 2021 55. Fresh mantle and gill tissues of S. plicata were separately chopped with a razor blade in independent Petri dishes containing 2 ml of ice-cold Galbraith and Otto buffer, respectively 56,57. As internal reference standards we used Acheta domesticus (female, 1C = 2 Gb) and chicken nuclei (Gallus gallus, 1C = 1 Gb) (Biosure). The suspension was filtered through a 42 μm nylon mesh and stained with the intercalating fluorochrome propidium iodide (PI, Thermo Fisher Scientific) and treated with RNase II A (Sigma-Aldrich), each with a final concentration of 25 μg/ml. The mean red PI fluorescence signal of stained nuclei was quantified using a Beckman-Coulter CytoFLEX flow cytometer with a solid-state laser emitting at 488 nm. Fluorescence intensities of 5,000 nuclei per sample were recorded. The software CytExpert 2.3 was used for histogram analyses. The total quantity of DNA in the sample was calculated as the ratio of the mean red fluorescence signal of the 2C peak of the stained nuclei of the S. plicata sample divided by the mean fluorescence signal of the 2C peak of the reference standard. Three replicates for each treatment were measured on 3 different days to minimize instrument error.
Reference genome assembly
PacBio CLR subreads raw sequence files (BioProject = PRJEB67507) were transformed from bam to fastq format using samtools v.1.10 58 and their quality assessed with Nanoplot v.1.28.1 59. Nanofilt v.2.6 59 was used to apply a length threshold filter of 25 kb to remove potential bacterial contamination and mitogenomic sequences shown as secondary GC content peaks in fastQC plots (Extended Data Fig.10). Both Omni-C and WGS-SR data (BioProject = PRJEB67507) were trimmed and filtered using Trimmomatic v.0.39 60 and their quality measured with fastQC v.0.11.9 61.
We used Flye v.2.8 62 to assemble the filtered Pacbio CLR with three iterations of self polishing, followed by three polishing rounds with the WGS-SR data using Pilon v.1.23 63. Finally, the assembly draft was deduplicated using purge_dups.py v.1.2.6 64. For genome scaffolding we used Omni-C data and uploaded the polished assembly to the program Juicer v.1.6 65, defining ‘none’ as the restriction enzyme since Omni-C libraries use an endonuclease with random cleavage sites. The function run-asm-pipeline.sh of the program 3d-dna v.201008 66 was called to obtain the contact map, which was manually curated using the interface of Juicebox v.1.5 67. The final chromosome-level assembly was recovered by running the script juicebox_assembly_converter.py of the same program.
Coverage along the genome assembly and mean coverage were tested by mapping the initial CLR and WGS-SR back to the polished assembly using the programs Minimap2 v.2.24 and BWA-mem v.0.7.17, respectively, and visualized with Qualimap v.2.2.1 and multiQC v.1.8 68–71 implemented in backmap.pl v.0.4 72 https://github.com/schellt/backmap. Contiguity statistics were obtained using QUAST, base-level accuracy (qv) was assessed with mercury v.1.3 73, and genome completeness was corroborated by detecting universal single copy orthologs of metazoa using BUSCO v.5.2.2 74We checked for DNA contaminations by comparing our sequences to those in NCBI using BLAST v.2.12 75. The resulting quality features values were graphically represented using the software BlobTools v.2.0 76. As a result, we generated a high-quality chromosome-level reference genome assembly for subsequent analyses, submitted to ENA (BioProject = PRJEB67507).
Reference genome annotation
We downloaded from repbase 77 previously reported transposable elements (TE) of the model species Ciona intestinalis type A (Ciona robusta), as it was the closest related species to Styela plicata with available TE data. Furthermore, we also downloaded the reference genome of the congener Styela clava 78 in order to aid in our TE annotation. We used RepeatModeler v.2.0 79 in both S. plicata and S. clava to generate de novo predictive TE models for the genus Styela. The resulting models obtained for both species were combined with the TE of C. robusta. The model file including the TEs of all three species was used to soft- and hard-mask the chromosome-level assembly with RepeatMasker v.4.1.2 80. We plotted each TE family abundance and Kimura2 substitution levels profile using the program RepeatLandscape.pl, included in RepeatMasker.
To annotate genes, a protein set of Stolidobranchia was downloaded from UniProt 81. In addition, RNA-seq data from gill, mantle, and gonads (BioProject = PRJEB67507) previously filtered with Trimmomatic v.0.39 was assembled into transcripts with Trinity v.2.11 82 and posteriorly merged and clustered by 99% similarity using CD-HIT v.4.8 83 to reduce redundancy. Both the Stolidobranchia protein set and our RNA-seq data were used to annotate the hardmasked genome using blast and exonerate, both implemented in MAKER v.2.31.10 84. Gene ‘ab-initio’ predictions were conducted by AUGUSTUS v.3.4.0, GeneMark-EP v.4.71 and SNAP v.2013-11-29, as implemented in MAKER 85–87. Three rounds of protein modeling were carried out in order to refine the genome annotation. For the first modeling round, BUSCO genes were used to generate a gene model with SNAP, whereas AUGUSTUS and GeneMark were based on RNAseq. After the first modeling, an annotation draft was obtained with MAKER. For AUGUSTUS and SNAP, second and third model rounds were carried out using as a training set those genes obtained by the most updated annotation draft. The new training models were used to reannotate the genome assembly. Additionally, in the last annotation round, tRNAscan-SE v.2.0 88 and snoscan v.0.9.1 89 were activated in MAKER to annotate tRNA and small non-coding RNA (snoRNA). As the last step, we predicted long non-coding RNA (lncRNA) with FEELnc v.0.0.1 90 and CPC2 91. lncRNA shared between both softwares which were not overlapping with any protein coding gene were selected as reliable lncRNA candidates and ultimately included in the S. plicata annotation. Finally, genes were compared against Pfam databases using eggNOG-mapper v.2 92 using default thresholds in order to assign Gene Ontology terms (GO terms), and the best match was recorded in the final annotation file.
Worldwide genome sequencing and genotyping
To assess the genome variation worldwide, we sequenced 24 additional Styela plicata individuals from different harbors previously studied 23: California, USA (n=4), Santa Catarina, Brazil (n=4), North Carolina, USA (n=4), Ferrol, Spain (n=4), Port Elizabeth, South Africa (n=4), and Misaki, Japan (n=4) (Extended Data Table 1, Fig. 1a). Since the cytochrome oxidase I (COXI) haplotypes of these samples were already known 23, we chose whenever possible a balanced proportion of individuals in every population with the two previously described haplogroups (Extended Data Table 1). DNA extractions were sent to Novogene for WGS-SR library preparation and sequencing aiming for 5 Gb of 150 bp paired end reads for each individual.
Global nuclear genome diversity and population differentiation
Illumina WGS-SR data of the 24 worldwide sampled individuals (Extended Data Table 4) were filtered with Trimmomatic v.0.39, and deposited in ENA (BioProject = PRJEB67519). Filtered reads of these individuals were mapped to the newly generated chromosome-level reference genome using BWA-mem and genotypes were called using the mpileup function in BCFtools v.1.10.2 58. VCFtools v.4.2 was used for filtering variants 93. We retained those biallelic SNPs present in 100% of the individuals with a minimum allele frequency of 10% and a minimum coverage of 5 reads. We finally removed loci with a mean coverage across samples above 25 reads, found as the upper whisker value for loci mean coverage, to avoid duplicates.
Sliding windows of 10,000 bp were used to calculate global nucleotide diversity (π) and global Tajima’s D using VCFtools, with a step of 2,500 bp for the π. Tajima’s D and global nucleotide diversity (π) Manhattan plots were drawn with the R package ‘CMplot’ v.3.1.3 94. We recodified all genotypes using the function --012 from VCFools v.4.2, where 0 represents homozygous genotypes for the reference SNP, 1 means heterozygous genotypes and 2 indicates homozygous genotypes for the alternative SNP. We used sliding windows of 10,000 bp with a 2,500 bp step in R to calculate the mean value of the recodified genotypes along the chromosome for each individual, in order to detect linkage blocks, and identify putative chromosomal inversions. Results from all the individuals were plotted using the R package ‘ggplot2’ v.3.4.1 95 chromosomal inversions were identified as linked blocks with constant genotypes. Thus, individuals with mean genotype values close to 0 indicated homozygotes for the reference SNPs (the same allele as the reference genome), individuals with mean genotype values close to 2 indicated homozygotes for the alternative SNP, and individuals with mean genotype values in-between indicate heterozygotes bearing both possible alleles. We defined as putative inversion’s boundaries the points in which the genotypes across individuals shift from heterozygous to homozygous on all individuals in the sliding window analysis. We tested if Tajima’s D values were significantly different within the inversions in relation to the remaining genome using the Wilcoxon test with the function ‘wilcox.test’ implemented in R, setting the paired parameter as false.
To have an alternative definition for the boundaries of the inversions, we selected individuals from both homokaryotypes according to the genotypes plotted in sliding windows, and we performed an FST analysis comparing the individuals with the two chromosomal arrangements with VCFTools. The FST was calculated using a 10,000 bp sliding window and a step of 2,500 bp. Values were graphically viewed using CMplot as a Manhattan plot. FST boundaries suggesting inversion breakpoints were defined as the window in which the FST values shifted from close-to-zero to close-to-one values.
A Multi-Dimensional Scaling (MDS) was obtained using the ‘cluster’ function of PLINK v.2.0 96, which is based on Identity By State (IBS) pairwise distances. The initial dataset included all the SNPs obtained after genotyping, and different subsets were also used for subsequent analyses: a dataset excluding all the chromosomes where we detected putative inversions, and four additional datasets, one per chromosome with inversions. For each dataset, the first five axes of the MDS were plotted, by correlative pairs, using the ‘ggplot2’ package.
In order to detect signals of regional adaptation, FST values were calculated between the Atlantic and Pacific biogeographic regions (see results). We kept as outliers those FST values falling in the top 1%. Manhattan plots were obtained for the FST values across the genome with the package ‘CMplot’ and outlier values were highlighted.
Global mitochondrial genome diversity
We used the filtered WGS-SR data of all individuals (including the individual of the reference genome and the 24 individuals sampled worldwide) to assemble individual mitogenomes with NOVOplasty v.4.2 97 using a publicly available Styela plicata mitogenome as reference (NC_013565.1), a COI sequence as a seed (HQ916426.1) and setting k=33. Mitogenomes were annotated using MITOS2 98, as implemented in the galaxy portal 99, with default parameters and the mitochondrial ascidian genetic code, followed by manual curation with the browser Geneious Prime (https://www.geneious.com/). The mitochondrial genomes of the ascidian species Styela clava (NC_037072), Botryllus schlosseri (NC_021463), and Ciona robusta (NC_034372) were downloaded for comparison. All protein-coding genes were extracted from the mitogenomes and independently aligned with MAFFT 100. To identify phylogenetic relationships between individual mitogenomes, independent gene trees were obtained by maximum likelihood approaches using IQ-TREE2 v.2.2.0 101, performing 50,000 iterations of ultra-fast bootstrap (-b), without codon position partitioning, and using an evolutionary model GTR+I+G for all of them. The consensus gene tree (supertree) based on the independent gene trees and its node support values were obtained using ASTRAL v.5.5.7 102. A second approach was conducted by concatenating all genes in a single matrix (supermatrix), and we ran IQ-TREE2 using the same parameters as previously mentioned. After phylogenetic reconstruction, different mitogroups were identified and analyses between them were performed. Alignment of the whole mitochondrial sequences was performed using the option LINS-i from MAFFT, in order to identify structural rearrangements. MEGA11 103 was used to calculate the whole mitogenome genetic distance among main clades by pairwise comparison with pairwise deletion of indel regions. To identify cyto-nuclear interactions, FST values between the two main mitogenomic groups were calculated with nuclear SNPs for all chromosomes (see results). In the analysis we included only individuals of the localities with an equal number of representatives of the two mitogenomic groups in order to reduce the population effect. Manhattan plots were obtained for the FST values across the genome with the package ‘CMplot’.
Functional analyses of candidate adaptive genome regions
All chromosomal regions potentially driving Styela plicata’s adaptive success (inversions, regional adaptation and cyto-nuclear interactions) were selected to identify their gene functions. As a first step we used eggNOG-mapper 92 in our annotated genes to obtain their corresponding Gene Ontology Terms (GO terms), using default parameters.
1. Inversions: We aimed for a screening on functional enrichment of genes in the inverted regions. For each inversion we generated two gene lists in which one contained the gene IDs found inside the inversion and the other included the gene IDs found in all chromosomes but excluding the IDs inside the same inversion. Boundaries detected by genotypes and FST values were considered independently for gene list construction. To detect functional enrichment, FatiGO 104 was used to compare the GO terms in the two gene lists (inside and outside each inversion) based on genotype and FST boundaries. Enriched GO terms were treated with the REVIGO website 105. A table with the resulting clusters of biological functions was provided. The absolute number of most abundant biological functions was graphically represented with ‘ggplot2’ package. Finally, the locations of the genes with enriched GO terms were visualized within the FST Manhattan plot.
2. Regional adaptation: We focused on all genes within the region with the highest and most abundant top 1% FST values differentiating Atlantic and Pacific populations, in chromosome 3 (see results). We selected the GO terms associated with the genes included in the region and treated them with REVIGO. A ‘TreeMap’ of the biological functions was plotted with the R package “treemap v.2.4-4” (https://cran.r-project.org/web/packages/treemap/index.html).
3. Cyto-nuclear interaction: We selected all the GO terms associated with the genes included in the region with highest and more abundant top 1% FST values for chromosomes 12 and 14 (see results), and treated them with REVIGO. In this case, we selected not only the biological function, but also the Cellular components of the GO terms to assess if the genes produced in the nucleus were active in the mitochondrions. TreeMaps of both classes were plotted with the “treemap” package. Additionally, we extracted the sequence of the genes falling in the highest FST windows with gffREAD v.0.12.8 106 and identified them using BLAST against the NCBI database.