Genome sequencing and assembly
Genomic DNA was extracted from the muscle of four hadal snailfish collected from the Mariana Trench, and four Tanaka's snailfish collected from the southern Yellow Sea. We generated a total of 47.8 gigabases (gb) of Nanopore reads, 148.6 gb of BGI short reads, and 123.3 gb of Hi-C reads for hadal snailfish; and 39.0 gb of Nanopore reads, 130.3 gb of BGI short reads, and 99.5 gb of Hi-C reads for Tanaka’s snailfish.
The genome sizes of hadal snailfish and Tanaka’s snailfish were estimated by k-mer distribution analysis (K = 27) of SOApec v255 to be 633.2 Mb and 539.9 Mb, respectively. We then assembled the hadal snailfish and Tanaka’s snailfish genomes based on the filtered Nanopore sequencing data using wtdbg2 v2.4.156 based on default parameters, followed by two rounds of error correction using NextPolish v1.057 based on the filtered BGI sequencing data, and finally assembled them into chromosomal versions using 3D-DNA v18011458 based on Hi-C data. Finally, BUSCO v 4.1.259 was used with the library ‘actinopterygii_odb10’ to analyze and evaluate the completeness of the gene set in our draft genome.
Transcriptome sequencing
A total of 11 transcriptomes from six tissues (eye, stomach, heart, liver, muscle, skin) were extracted from three hadal snailfish, while a total of 26 transcriptomes from 10 tissues (brain, spinal cord, eye, bone, cholecyst, stomach, heart, liver, muscle, skin) were extracted from three Tanaka's snailfish (Supplementary Data 1). RNA was subsequently extracted using TRIzol (Invitrogen) and purified using the RNeasy Mini Kit (Qiagen). Transcriptome reads were obtained from the Illumina HiSeq 2000 sequencing platform. The RNA sequences were filtered by Fastp v0.2060 and assembled without reference using SPAdes61 with default parameters. Subsequently, TransDecoder v5.5.0 (https://github.com/TransDecoder/TransDecoder) was used to identify coding regions of the transcripts.
Genome annotation
Both de novo and homology-based predictions were used to identify repetitive elements in hadal snailfish and Tanaka's snailfish. First, we constructed a de novo transposable element library using RepeatModeler v1.0.1162, and then used RepeatMasker v4.0.763 to detect repeats. For homologous annotations, the genome sequences were compared with data from Repbase using RepeatMasker v4.0.7 and RepeatProteinMask v1.36 to predict transposable elements. For tandem repeat sequences, we used Tandem Repeats Finder v4.0764 to make predictions.
The repeat masked genome was used for the gene annotation. We used a combination of ab initio gene predictions, homologous gene predictions, and direct gene models produced by transcriptome assembly to identify protein-coding genes structure on the genome as follows.
Step 1: Augustus v3.2.165 was used to generate ab initio predictions with internal gene models.
Step 2: The protein sequences from seven species, medaka, Atlantic cod, flatfish, stickleback, zebrafish, turbot, and fugu (Supplementary Table 6) and the transcriptome predicted protein sequences were used to align genomic sequences with BLAT v. 3566.
Step 3: The psl files obtained in the previous step were integrated and the protein sequences that were aligned to the overlapping region of the genome were scored and sorted based on the alignment results using a custom script to filter out the best aligned protein sequences in this region. Then, GeneWise v2.4.167 was used to predict gene models with the aligned sequences as well as the corresponding query proteins. The custom scripts have deposited in GitHub (https://github.com/wk8910/bio_tools/tree/master/42.prediction).
Step 4: The Evidence Modeler (EVM) v1.1.168 was used to integrate the prediction results with different weights for each.
The integrated gene set was translated into amino acid sequences using InterProScan v569 to annotate motifs and domains in protein sequences by searching publicly available databases (including Pfam, PRINTS, PANTHER, ProDom and SMART) and the genes were further annotated using the KEGG databases.
Variant calling using resequencing data
Short reads of seven Mariana hadal snailfish, one Yap hadal snailfish and five Tanaka's snailfish (Supplementary Table 1) were mapped to the hadal snailfish genome assembled in this study with BWA v0.7.12-r103970; then SAMtools v1.471 was used to sort and obtain BAM files. To analyze population genetics, we focused on SNPs and small indels (1–10 bp)72. The SNPs were called by FreeBayes v0.9.10-3-g47a713e73 with parameters “--gvcf --min-coverage 5 --limit-coverage 200” and filtered by following three thresholds: (1) SNPs with missing rate £ 30%; (2) The highest sequencing depth of SNP position < 200 ×; (3) The lowest sequencing depth for each allele ≥ 5. Subsequently, we calculated the distribution of heterozygosity in genome-wide regions with 500 kb non-overlapping sliding windows.
Inference of phylogeny history
SNP tree, PCA and diversity statistics. PLINK v1.90b6.674 was used to perform PCA and other population divergency statistics, including nucleotide diversity and genetic differentiation (FST). A neighbor-joining tree was constructed with PHYLIP v3.69775 for paired genetic distance matrices.
Admixture analysis. Different K values (from 1 to 5) were tested using Admixture v1.3.076 to infer ancestral populations in all hadal snailfish and Tanaka’s snailfish individuals accessions.
Demographic analysis. The demographic history of hadal snailfish and Tanaka’s snailfish was inferred with pairwise sequential Markovian coalescent (PSMC)77 analysis, based on a substitution rate of 1.9174e-09 per generation for hadal snailfish and 5.6790e-09 per generation for Tanaka’s snailfish. The analysis was performed using the following parameters: −N25 −t15 −r5 −p ‘4+25 × 2+4+6’.
Mitochondrial genome phylogenetic reconstruction and divergence time estimation. The mitochondria of eight hadal snailfish and five Tanaka's snailfish were assembled by NOVOPlasty v4.3.178 with default parameters and annotated by MITOS (http://mitos2.bioinf.uni-leipzig.de/index.py). Subsequently, mitochondrial data from currently published species of the Liparidae were combined (Supplementary Table 12), nucleic acid sequences of 13 coding genes on mitochondria were aligned by MUSCLE v3.8.42579 using default parameters, and alignments of the coding sequences were generated with pal2nal v14 using default parameters. The maximum likelihood (ML) tree was constructed by RAxML-8.2.1280 using the following parameters: -f a -m GTRGAMMA -p 15256 -x 271828 -N 100. Finally, divergence times were estimated by MCMCtree v4.9j81 with 1 soft-bound calibration time-point (snailfish-stickleback: ~32-73 Ma). For Notoliparis kermadecensis, we combined all the above mitochondrial data and performed the same above analysis based on co1 and cytb gene sequences to obtain the ML tree and divergence times.
Gene loss and duplication
Here, we applied an improved read mapping-based method to identify gene loss and duplication, which is effective in reducing false positives and false negatives caused by genome assembly and annotation errors as well as multi-species sequence alignments. The custom scripts have deposited in GitHub (https://github.com/wenjie-xu-nwpu/hadal_snailfish). Although this method may have limitations for identifying gene loss and duplication in species with long divergence times, the divergence times of hadal snailfish and Tanaka's snailfish are about 20 million years7, and at least 88% of the reads in all hadal snailfish individuals can be well compared to Tanaka's snailfish genome, indicating that this method is applicable to this study.
For gene loss, the following methods were used for identification. (1) Short reads of eight hadal snailfish and five Tanaka's snailfish (~30X) were compared to the stickleback and Tanaka’s snailfish genome using BWA v0.7.12-r103970 and subsequently sorted by Samtools v1.471 to obtain the BAM files. (2) We obtained the reads depth for each sites in the gene coding region based on the annotation information of the reference genome, and subsequently classified the depths we had on individual loci into 3 types (“HIGH” for greater than half of the average coverage, “LOW” for less than 3, and “MID” for the rest). We defined sites with “HIGH” for Tanaka's snailfish and “LOW” for hadal snailfish as hadal snailfish-specific lost sites (SLS). Then, the genes with SLSs accounting for at least 40% of the coding sequence length were selected as the candidate specific loss genes. (3)The protein sequences of the genes selected in the previous step were used as a reference to search through the genome of hadal snailfish by BLAT v. 3566 and predict the gene structure using GeneWise v2.4.167 to determine the genes that were completely lost or partially lost in this species. (4) The synteny alignment between the hadal snailfish, Tanaka’s snailfish and stickleback were plotted for partial or fully lost of the gene.
For gene duplication, the following methods were used for identification. (1) Short reads of eight hadal snailfish and five Tanaka's snailfish were compared to the stickleback and Tanaka’s snailfish genome using BWA v0.7.12-r103970 and subsequently sorted by Samtools v1.471 to obtain the BAM files. (2) The homologous sites whose average value of reads depth of all hadal snailfish individuals were greater than 1.5 the average value of the Tanaka’s snailfish individuals were retained and defined as hadal snailfish specific high-copy sites (HCS). Then, the genes with HCSs accounting for at least 50% of the coding sequence length were selected as the candidate high-copy genes. (3) We searched for the location of this gene on the hadal snailfish genome by BLAT v. 3566 and predicted the gene structure by GeneWise v2.4.167 to determine its copy number. (4) Finally, the expansion of this gene was determined by constructing a gene tree of the protein sequences of this gene family from nine species, hadal snailfish, Tanaka’s snailfish, medaka, Atlantic cod, flatfish, stickleback, zebrafish, turbot, and fugu (Supplementary Table 6).
Identification of unitary pseudogenes
Unitary pseudogenes are non-functional genes that decay at their original location82, and we suggest that some missing homologs will exist in hadal snailfish genome as unitary pseudogenes during their adaptation to the special environment of the hadal zone.
We obtained pseudogenes in hadal snailfish by following five steps. (1) Using the stickleback genome sequence as a reference, we performed synteny alignment for three species (hadal snailfish, Tanaka's snailfish and stickleback) with Last v95683 using the parameters ‘-E0.05', generating a total of 382 Mb (of which 290 Mb was informative for all species) of one-to-one alignment sequences with Multiz v184 using the default parameters. (2) Genes with at least 70% of the coding sequences of stickleback or Tanaka’s snailfish present in the MAF and not present in the corresponding regions of hadal snailfish were selected as alternative unitary pseudogene datasets. (3) We used BLAST v2.9.085 to determine if this gene was present in other regions of the hadal snailfish genome. (4) The hadal snailfish corresponding region was extended left and right by 10 KB, and the genes of stickleback and Tanaka's snailfish were used as references for predict the gene structure using GeneWise v2.4.167. (5) Screening for pseudogenes that were consistent in all hadal snailfish individuals.
De novo originated new genes
First, the short reads of eight hadal snailfish and five Tanaka's snailfish were compared to the hadal snailfish genome using BWA v0.7.12-r103970 and subsequently sorted by Samtools v1.471 to obtain the BAM files. In the second step, we defined a single sequenced sample with reads depths < 10 at a single locus as a deletion locus. Based on the annotation file of hadal snailfish, we screened all Tanaka's snailfish individuals for genes with deletions >50%. Next, For the genes specifically present in hadal snailfish selected in the previous step, we used BLAST v2.9.085 to align them with the genomes of eight other fishes (Tanaka’s snailfish, medaka, Atlantic cod, flatfish, stickleback, zebrafish, turbot, and fugu) and screened for genes with a matching region < 0.4. Genes with TPM maxima less than 1 in each tissue of hadal snailfish were filtered out. The fully annotated genes (presence of start and stop codons) in the results were defined as novel genes of hadal snailfish.
Lineage-specific changes in amino acid sequences
For 17 species, Tanaka’s snailfish, stickleback, pacific bluefin tuna medaka, platy fish, Atlantic cod, flatfish, zebrafish, turbot, fugu, spotted gar, coelacanth, chicken, mouse, human, brownbanded bamboo shark and elephant shark (Supplementary Table 6), we identified one-to-one orthologues for each species and hadal snailfish by the RBH method, and subsequently selected genes present in 15 species, including hadal snailfish, as conserved gene sets. Next, the protein sequences of the selected genes were aligned using MAFFT v7.47186, and a custom script was used to select regions that were consistent in other species and had contiguous specificity at sites greater than 3 bp in hadal snailfish, and that had at least 90% sequence identity for each 5 bp region before and after this variant region87. Finally, genes with consistent variants in all hadal snailfish individuals were selected.
We performed protein structure simulation using AlphaFold288 for the amino acid sequences of target genes in hadal snailfish and Tanaka's snailfish. Finally, the highest scoring prediction was selected as the best structure and visualized by UCSF Chimera89.
Comparative transcriptome analysis
For the RNA sequences of hadal snailfish and Tanaka's snailfish, we used Fastp v0.20.060 to filter out low quality and contaminated reads, and then used Hisat2 v 2.1.090 to align them to the respective reference genomes. StringTie v1.3.691 was then used to count the number of reads paired for each gene with the help of gene annotation information of the species, and then TPM values were calculated for each gene in both species. Next, we identified 17281 one-to-one orthologues of hadal snailfish and Tanaka's snailfish using the RBH method. Subsequently, we identified the genes that were differentially expressed (DEGs) between the same tissues of two species using the R package DESeq2 with |log2 (foldchange)| ≥ 1 and corrected P < 0.05. For genes that are up-regulated or down-regulated in multiple tissues, we first found by stochastic simulation that a gene is differentially expressed between two species in one organ does not affect the probability that this gene is differentially expressed in any other organ. Subsequently, we counted the genes that were up- regulated or down-regulated in each tissue to obtain a list of genes that were co-altered in multiple tissues.
Data availability
The sequence data and genome assembly files of the hadal snailfish have been deposited in the NCBI BioProject database with accession numbers PRJNA852951 (genome data) and PRJNA855356 (transcriptome data).