Study site, sampling and physicochemical analyses
The Fankou Pb/Zn sulphidic mine tailings site (25˚2ʹ56.5ʺN, 113˚39ʹ48.5ʺE) is located in Shaoguan, Guangdong province, China. Extremely acidic, heavy metals rich drainage is a persistent feature due to microbially mediated dissolution of sulfide minerals in the tailings at this site. Previous 16S rRNA surveys have documented vertical stratification of geochemistry and prokaryotic populations, with acidophilic archaea, mostly Ferroplasma spp. in the Thermoplasmatales predominant in the upper layers of tailings (oxidized zones and the oxidation front) [26]. Two tailings cores (inner diameter, 8 cm; length, 60 cm) were sampled from an area covered with AMD using a sampling collector in October 2017. After retrieval, the cores were immediately sectioned into distinct layers based on their physical feature and appearance (e.g., colors), yielding six layers for core A and five layers for core B (Additional file 1: Figure S1). Each of the 11 tailings layers was collected in 50 ml sterile tubes, kept in an icebox and transported to the laboratory, where the samples were stored at 4 °C prior to subsequent analyses.
Air-dried subsamples were analyzed with standard methods for the determination of total organic carbon (TOC) (TOC-VCPH; Shimadzu, Columbia, MD), total nitrogen (TN) and phosphorus (TP) (SmartChem; Westco Scientific Instruments Inc., Brookfield, CT). The pH and electrical conductivity (EC) were measured in a 1:2.5 (w/v) aqueous solution using a pH meter and an EC meter. HCl-extractable ferrous iron was determined by the 1, 10-phenanthroline method at 530 nm [27], and sulfate (SO42-) was measured by a BaSO4-based turbidimetric method [28]. Total concentrations of heavy metals (including Pb, Zn, Cu, Cr, Mn, and As) and sulfur (TS) were determined by inductively coupled plasma optical emission spectrometry (ICP-OES; Optima 2100DV, PerkinElmer, Wellesley, MA) and an elemental analyzer (Vario EL, Elementar, Germany), respectively.
DNA extraction and 16S rRNA amplicon and metagenomic sequencing
Total community genomic DNA was extracted using the FastDNA Spin kit (MP Biomedicals, Irvine, CA) according to the manufacturer’s instructions. The V4 region of bacterial and archaeal 16S rRNA genes was amplified with prokaryotic universal primers F515 (5'-GTGCCAGCMGCCGCGGTAA-3') and R806 (5'-GGACTACVSGGGTATCTAAT-3') [29]. A sample-specific 8-bp error-correcting barcode was added to the reverse primer. PCR amplification was conducted in triplicate in 50-µl reaction mixtures following the thermal cycling procedure described previously [30, 31]. Replicate PCR reactions from each sample were pooled and concentrated and purified using a QIAquick Gel Extraction Kit (Qiagen, Chatsworth, CA). A single composite sample was prepared by combining an approximately equimolar amount of PCR product from each tailings sample and then sequenced on an Illumina MiSeq platform (Illumina, San Diego, CA) (250bp, paired end reads). To obtain metagenomic data, extracted DNA was purified using a QIAquick Gel Extraction Kit (Qiagen, Chatsworth, CA), quantified with Qubit (Thermo Fisher Scientific, Australia), and then randomly amplified using IllustraTM GenomiPhiTM V3 DNA Amplification Kit (GE Health Care, United Kingdom). The amplified products were used for library preparation with NEBNext Ultra II DNA Prep Kit (New England Biolabs, Ipswich, MA) and sequenced with MiSeq Reagent Kit v3 on an Illumina MiSeq platform (150bp, paired end reads). Finally, 50-Gigabyte sequence data was obtained for each of the samples.
Processing of 16S rRNA and metagenomic sequence data
16S rRNA raw data were processed and analyzed with the Mothur software package (version 1.38.1) and QIIME (1.9.0) [32, 33]. Briefly, obtained short reads were noise reduced to minimize sequencing error by using the commands of ‘shhh.flows’ and ‘pre.cluster’ in Mothur [32]. Then, putative chimeric sequences were identified and removed by using Chimeric Uchime [34]. Pair-end reads were assembled via the ‘make.contigs’ command, and the primers and barcodes in assembled sequences were removed using the ‘trim.seqs’ commond [32]. Operational taxonomic units (OTUs) were identified by clustering assembled sequences at the 97% similarity level using UCLUST algorithm [34]. Taxonomic classification of the phylotypes was determined based on the Ribosomal Database Project at a default threshold of 80% [35]. Finally, the non-rarified OTU table (table of counts of OTUs on a per-sample basis with singleton OTUs excluded) and OTU taxonomy were converted to a ‘biom’ format to obtain prokaryotic community composition at different taxonomic levels by using the script of ‘summarize_taxa_through_plots.py’ in QIIME [33, 36, 37].
Metagenomic reads were quality filtered and trimmed using in-house Perl scripts [38]. A trim quality threshold of 20 was used and reads containing more than 5 ‘N’ were discarded. All quality-controlled reads from a tailings core were cross-assembled using SPAdes 3.9.0 and kmers of 21, 33, 55, 77, 99, 127 under the ‘--meta’ mode [39]. Genes were predicted by Prodigal 2.6.3 (with the parameters set as “-p meta -g 11 -f gff -q -m -c”) [40], and functional annotation was performed through assignment of predicted proteins to the Pfam 32.0 [41], Kyoto Encyclopedia of Genes and Genomes (KEGG) database [42], and Non-supervised Orthologous Groups (EggNOG v5.0.0) [43]. Briefly, predicted proteins were compared to Pfam database by using the InterProscan 5.0 software with settings of “-appl Pfam -irplookup” and the lowest E-value as the best hits. Additionally, blastp was used to assign viral proteins to KEGG and EggNOG database to get KO and COG terms (E-value: 10-5).
To access the dynamics of individual scaffolds and genes, sequencing reads from each library were mapped onto sequences using Bowtie2 with default parameters [44]. The normalized coverage for a given scaffold or gene was computed as the average scaffold or gene coverage (that is, the number of nucleotides mapped to the scaffold or gene divided by the scaffold or gene length) divided by the number of reads in a given library and multiplied by the mean value of the number of reads in the 11 libraries [5].
Identification and clustering of viral scaffolds
Three methods were applied to identify viral scaffolds in the metagenomic assemblies: (1) viral protein families generated with isolate reference viruses and viral scaffolds identified from a collection of geographically and ecologically diverse samples according to metadata from the Integrated Microbial Genomes with Microbiome Samples (IMG/M) system [45], (2) VirSorter software based on the identification of viral hallmark genes, enrichment in hypothetical proteins and other viral signatures [46], and (3) VirFinder software applying a k-mer frequency based machine learning method [47]. First, viral protein family models were used as a bait to screen metagenomic scaffolds longer than 5 kb and then filtered by inspecting the number of genes covered with viral protein families, Pfams and KO terms, as previously described [45]. Next, metagenomic scaffolds longer than 3kb were processed with VirSorter using the Viromes database [46]. Predicted viral scaffolds in the categories 1 and 2 were then manually curated as described previously [48]. For scaffolds in the categories 4 and 5, only predicted prophage regions were retained. Then, VirFinder was applied to search all scaffolds longer than 1kb, q-values were computed for the identified viral scaffolds, and the scaffolds having q-values < 0.05 were predicted as viruses. Finally, if the viral scaffolds predicted by viral protein families and VirFinder contain a prophage prediction, these viral scaffolds were removed from the viral sequence pools identified by these two methods before all identified viral scaffolds were merged.
All viral scaffolds were clustered into viral OTUs (vOTUs) using the parameters 95% average nucleotide identity and 85% alignment fraction of the smallest scaffolds [49]. To place the viral scaffolds in the context of known viruses, a gene-content based network analysis was used to cluster viral scaffolds into viral clusters (VCs). Briefly, predicted proteins from viral scaffolds were clustered with predicted proteins from isolate reference viruses in the NCBI database (dsDNA viruses, ssDNA viruses and retroviruses combined) [50] based on all versus-all blastp search with an E-value of 10-3, and protein clusters were defined with the Markov clustering algorithm and processed using vConTACT v.2.0 [51, 52].
Reconstruction of prokaryotic genomes and host prediction of viral scaffolds
All cross-assembled scaffolds longer than 2.5kb were binned using MetaBAT v2.12.1 [53], MaxBin v2.2.2 [54], Abawaca v1.00 (https://github.com/CK7/abawaca), and Concoct v0.4.0 [55] with default parameters, considering tetranucleotide frequencies, scaffolds coverage and GC content, and then the results were combined using DASTool [56]. Bins were further manually curated to obtain high-quality genomes using RefineM v0.0.24 [57]. In detail, the automatic binning methods may separate a “true” genome bin into two or more smaller, separate bins. Bins that shared a similar coverage range, GC content and identical taxonomic classifications as determined by CheckM v1.0.7 [58] were grouped into a single bin. Additionally, scaffolds with incongruent taxonomic classification and incongruent 16S rRNA genes were removed as implemented in RefineM v0.0.24 [57]. The completeness and contamination of genome bins were assessed using CheckM v1.0.7 [58], and genomes estimated to be more than 50% complete and less than 10% contaminated were classified using the genome taxonomy database (GTDB-Tk v0.3.0) [59].
Viral scaffolds were putatively linked to their hosts in silico [60]. Briefly, these linkages were based on (1) shared genomic content between viral scaffolds and host genomes, (2) prophages identified in host genomes, and (3) sequence similarity between spacers in microbial CRISPR regions and in the viral scaffolds. All viral scaffolds were compared to the recovered host genomes (E-value ≤ 10-3, bit score ≥ 50, alignment length ≥ 2.5 kb and identity ≥ 70%) using blastn [4]. Viral sequences identified as prophage were matched to their corresponding host genomes. CRISPR spacers were recovered from metagenomic scaffolds using metaCRT with default parameters [61]. Extracted spacers were compared to viral scaffolds using blastn with thresholds of no mismatches over the whole spacer length and an E-value ≤ 10-10 [1, 4].
Analysis of AMGs
Viral genes predicted by Prodigal [40] were assigned to EggNOG v5.0.0 database [43] using blastp (threshold of 50 for bit score and 10-5 for E-value). Viral AMGs assigned as COG0175 (sulfate reduction) were identified in the viral genomes [62], and then compared to the protein sequences in EggNOG v5.0.0 database [43] (blastp, threshold of 50 for bit score and 10-3 for E-value) to recruit relevant reference sequences (up to 20 for each viral AMG sequence) [4]. These sets of viral AMGs and related protein sequences were then aligned with Muscle v3.8.31 [63] and filtered by TrimAL 1.2rev59 [64] to remove columns comprised of more than 95% gaps. Phylogenetic trees were reconstructed using RAxML (version 8.2.8 with the parameters set as “-f a -m GTRGAMMA -n boot -c 25 -p 12345 -x 12345”) [65]. The resulting newick file with the best tree topology determined as with the best likelihood score was uploaded to iTOL v4 [66] for visualization and formatting.
Statistical analyses
All statistical analyses were implemented with various packages within the statistical program R. Pearson correlations were performed using ‘rcorr’ function (Hmisc package) to assess the relationships between the diversity of viruses, prokaryotes and environmental variables in all samples. Bray–Curtis distances were used to construct the dissimilarity matrices for prokaryotic and viral community structure and function profiles, whereas Euclidean distances were calculated using standardized environmental variables (vegan 2.5-4). Permutational multivariate analysis of variance (‘Adonis’ function; 999 permutations) was used to test for significant differences between classified groups of samples (vegan 2.5-4). Mantel tests were performed to reveal the correlations between the dissimilarity matrices (vegan 2.5-4). Statistical significance of differences in normalized coverage of a given gene or COG between two datasets was determined using non-parametric Wilcoxon t-test (unpaired), with confidence intervals at 99% significance and Benjamini–Hochberg correction (P < 0.05).