Sampling and metagenomic sequencing
Marine coastal, cold seep, and hydrothermal sediment samples were acquired from the cruises: the R/V Chuangxin Yi to Bohai Sea (BS) on 18-26th, August, 2018, the submersible Shen Hai Yong Shi and her supporting vessel R/V Tan Suo Yi Hao to Haima (HM) cold seep in South China Sea in May, 2018, and the R/V Atlantis to Guaymas Basin (GB) in 2009, and the submersible Shen Hai Yong Shi and her supporting vessel R/V Tan Suo Yi Hao to Longqi (LQ) hydrothermal vent in Southwest Indian Ocean, respectively. Sampling details for hydrothermal samples from Guaymas Basin described previously15. Samples from the BS were collected using a stainless-steel box-sampler. An 11 cm diameter polyvinyl chloride (PVC) tube with dark-tap sealed 2 cm interval side-holes was inserted into the box-sampler after carefully removing top water to take sediment-core samples, and the sub-sample was taken through the side-hole using a cutoff plastic syringe. Push core sediment samples were collected from three active cold seep sites: background (SY72-5), close to clam (SY70-4), and close to mussel (SY70-5) communities in the HM cold seep area, and dissected into sub-samples every 2 cm (Supplementary Table 1). The backgrounds in the cold seep sampling areas were described previously35. The microbial mat in the LQ hydrothermal vent was sampled with a box connected with a sucking tube. All samples were immediately frozen at -80 ℃ on the ship until DNA extraction in the laboratory. Details of DNA extraction and sequencing for samples from GB were described previously15. DNA was extracted using the DNeasy PowerSoil kit (QIAGEN, Germany) for the samples from BS, and sequenced on an Illumina Xten platform. DNA of samples from HM cold seep and LQ hydrothermal vent were extracted using FastDNA™ SPIN Kit for Soil (MP Biomedicals, USA), and sequenced on an Illumina Novaseq platform.
Metagenomic processing, assembly, and binning
Paired sequences were trimmed and quality controlled using Sickle v1.3336 and assembled using IDBA-UD v1.0.937. Binning of assemblies greater than 2,000 bp from GB samples were described previously15. The assemblies longer than 2,000 bp from the BS samples, HM cold seep samples, and LQ hydrothermal vent were binned in the similar procedures. Briefly, the scaffolds were filtered through VIBRANT v1.2.038. Scaffolds annotated as lytic viruses were removed before binning. The assemblies were binned using the combination of CONCOCT v0.4.039, MetaBAT v2.12.140, MaxBin v2.2.741, and DASTool v1.1.242. The high-quality reads were mapped to the assembly using BWA v0.7.1743 with the BWA-MEM algorithm and default settings. The generated sam file was converted and sorted to bam file using SAMtools v0.1.1944. The resulting bam files for each assembly were summarized using jgi_summarize_bam_contig_depths in MetaBAT to generate the contig depth file. The quality of metagenomic assembled genomes (MAGs) were estimated using CheckM lineage_wf v1.0.545. MAGs with over 50% completeness and 10% contamination were manually refined using mmgenome for MAGs recovered from GB samples and mmgenome2 for MAGs recovered from the rest samples46. In the end, 55 MAGs were picked from over 8,000 MAGs for the downstream analysis with a threshold of >50% completeness and <10% contamination estimated using CheckM45. The predicted genome size was estimated based on the size of MAGs and the percentage of completeness.
Phylogenomic analyses
To define the phylogeny of the MAGs, archaeal and bacterial genomes from representative taxa were downloaded from NCBI as the reference dataset. A set of 37 single-copy, protein-coding housekeeping genes was chosen, extracted, aligned, and concatenated from the MAGs and reference genomes using Phylosift v1.0.147. The concatenated alignment was refined using MAFFT v7.45048 with the setting --maxiterate 1000 --localpair, trimmed using BMGE v1.1249 with the setting -m BLOSUM30 -g 0.5 -b 3, and manually checked. The refined alignment was used to generate a maximum-likelihood tree using RAxML v8.2.450 with the parameters: raxmlHPC-PTHREADS-AVX -m GTRGAMMA -N autoMRE -p 12345 -x 12345. Based on the phylogenetic tree, additional 4 and 2 MAGs downloaded from National Center for Biotechnology Information (NCBI) and Integrated Microbial Genomes & Microbiomes (IMG/M), respectively, were phylogenetically related to the MAGs, and included for further analyses. In addition, the taxonomic information of 61 targeting MAGs (55, 4, and 2 MAGs from this study, NCBI, and IMG/M, respectively), was further determined using GTDB-Tk v1.1.151 with release89. Amino acid identity (AAI) of MAGs was estimated using CompareM (v0.1.2) AAI workflow (‘comparem aai_wf’, https://github.com/dparks1134/CompareM).
The 16S rRNA gene sequences were extracted using Barrnap v0.9 (https://github.com/tseemann/barrnap) with the default settings, aligned, and manually curated in ARB52 with the SILVA SSURef NR99 database (release 138). The alignment was exported to generate a maximum-likelihood tree using IQ-TREE v1.6.1253 with the settings: -bb 1000 -bnni -nt AUTO.
Distribution of five phyla
To identify the distribution of these five phyla across different environments, a homology-based approach was used to search based on 16S rRNA sequences against the IMG/M database54. The threshold to determine the presence of the target groups in the environment was set as 80%, except for one sequence with a higher threshold, of the highest bit score based on the search against the 16S rRNA public assembled metagenomes as of 4th-July, 2020 (Supplementary Table 5). The threshold is higher than the highest bit score based on the search against the RNA public isolate as of 8th-July, 2020 in IMG/M (Supplementary Table 5).
Annotations and metabolic prediction
Gene prediction for the MAGs was performed using Prodigal v2.6.355 with default settings. Predicted genes were annotated using MEBS v1.116, KofamScan v1.3.0 with the e-value cut-off of 1e-556, and further characterized using KAAS (KEGG Automatic Annotation Server) web server57 using the ‘Complete or Draft Genome’ setting with parameters: GHOSTX, custom genome dataset, and BBH assignment method. In addition, the protein domains were determined using InterProScan v5.46-81.058 with the settings: -dp -iprlookup -pa kegg,metacyc,reactome -goterms. The cluster of protein content was performed as previously reported15 using MEBS v1.116 with default setting.
Additionally, the key metabolic genes were searched using custom databases. In brief, peptidases in MAGs were identified using DIAMOND BLSATP v0.9.31.13259 to search against with MEROPS pepunit database60 with the settings: -e 1e-10 --subject-cover 80 --id 5061. Genes encoding for carbohydrate active enzymes (CAZYmes) were identified using the dbCAN standalone tool62 with default thresholds. The localization of identified peptidases and CAZYmes was determined using the command-line version of Psort v3.0 using the option --negative for the MAGs.
Genes encoding for dissimilatory sulfite reductase (DsrAB), sulfide-quinone reductase (SQR), and hydrogenase were further identified using DIAMOND BLSATP v0.9.31.13259 to search against with different custom databases with the thresholds: -e 1e-10 --subject-cover 70 --id 50; -e 1e-10 --subject-cover 50 --id 30; and -e 1e-10 --subject-cover 50 --id 40 for DsrAB, SQR, and hydrogenase genes, respectively. The identified Dsr sequences were separately aligned with reference sequences using MAFFT v7.45048 with the setting --maxiterate 1000 --localpair, and trimmed using BMGE v1.1249 with the setting -m BLOSUM62 -g 0.5 -b 3. The identified SQR sequences were aligned with reference sequences using MAFFT v7.45048 with the -auto option, and trimmed using trimAl v1.2rev5963 with the -gappyout option. All alignments were manually checked, and the short and poorly aligned sequences were removed. The maximum-likelihood trees for dissimilatory sulfite reductase (DsrAB) and sulfide-quinone reductase (SQR) were generated using RAxML v8.2.450 with the parameters: raxmlHPC-PTHREADS-AVX -m GTRGAMMA -N autoMRE -p 12345 -x 12345. The identified hydrogenase sequences were further compared with the annotation based on the assigned KO number, and the web-based hydrogenase classifier64. The final identified hydrogenase sequences with selected references of different types of hydrogenases65 were aligned using ClustalW v2.166, and the Neighbor-Joining tree was generated using MEGA X67 under p-distance model with 1,000 bootstrap. All final trees were visualized using the Interactive Tree Of Life (iTOL) webtool68.
Novel protein analysis
We computed gene family clusters on the combined gene set of the 55 genomes using MMseqs2 with relaxed thresholds: minimum percentage of amino acids identity of 30%, E-value < 1e-3, and minimum sequence coverage of 50% (--min-seq-id 0.3 -c 0.5 --cov-mode 2 --cluster-mode 0). For detecting families with no homologs in reference databases, we mapped i) the protein sequences of the 55 genomes against EggNOG using eggNOG-mapper v2 (hits with an e-value < 1e-3 were considered as significant) ii) the protein sequences of the 55 genomes against PFamA using HMMER (hits with an E-value < 1e-5 were considered as significant), iii) the protein sequences of the 55 genomes against PFamB using HMMER (hits with an E-value < 1e-5 were considered as significant) and iv) the CDS sequences of the 55 genomes against Refseq using diamond blastx (sensitive flag, hits with an E-value < 1e-3 and query coverage > 50% were considered as significant). We only considered as novel families those with no significant homology in any of these databases.
For addressing the taxonomic breadth of the novel families, we mapped the longest sequence of each family against a collection of the 169,484 genomes coming from diverse sequencing efforts69–76 used for describing novel lineages across the prokaryotic phylogeny8 using diamond blastp (sensitive flag, hits with an E-value < 1e-3 and query coverage > 50% were considered as significant). We expanded each family with the hits in this database. We subsequently ran Multiple Sequence Alignments for each gene family using Clustal Omega; and reconstructed their phylogeny with FastTree2. We considered a novel family to be present in the novel gene family collection described in Rodriguez del Rio et al8 , built on the 169,484 genomes, if more than 90% of their members were homologous.
We later reconstructed the genomic context of the extended novel families. We built a database including the positions of all the genes in each scaffold. For each of our final extended novel protein families, we calculated a functional conservation score of the genes in a +/- 3 window. For such a purpose, we measured the vertical conservation of each EggNOG Orthologous group (OG), KEGG pathway, KEGG orthology, KEGG module and PFAM in each position (number of genes with a functional annotation / number of genes in the family).
We also calculated the taxonomic dispersion of each novel protein family. Specifically, for each lineage in which a family was detected, we measured the coverage (number of genomes from the lineage in the family / total number of genomes from the lineage in the database) and specificity (number of genomes from the lineage in the family / total number of genomes in the family) of the family. To determine the number of novel families in other prokaryotic lineages, we followed the same strategy as for calculating novel families within the 55 genomes of this study. We built protein families on the proteomes of 169,632 prokaryotic genomes69–76 with mmseqs and mapped them against eggNOG, pfamA and B and RefSeq. Families with no significant hits to any of these databases were considered novel.
Data availability
All sequence data and sample information are available at NCBI under BioProject ID PRJNA692327 and PRJNA362212 (Guaymas Basin), PRJNA743900 (Bohai Sea), PRJNA819461 (Haima cold seep), and PRJNA819455 (Southwest Indian Ocean). Accession numbers for individual genomes can be found in Supplementary Table 2.