The proposed protocol estimates the amount of internal standard (“spike-in”, hereon) to add to biological samples for a desired coverage in the sequencing run. This is achieved by measuring via quantitative PCR (qPCR) the copy number of each marker in a few representative samples and in a spike-in solution. Then, the input amount of spike-in added to the samples is adjusted based on these quantifications.
Sampling and sample processing
A total of twelve soil samples were collected from strawberry fields at the IFAPA experimental station “El Cebollar”, located at Moguer (Huelva, Spain) (37º14′25.4” N 6º48′09.2” W): four different time points, with three replicas at each time, along two consecutive campaigns (Table 1). An auger of ~ 4 cm in diameter was used to sample cores of soil at 5–6 points within the sampling unit. Soil was placed in sterile plastic bags and homogenized manually (Rodriguez-Mena et al. 2022). Samples were lyophilized in the Telstar® LyoQuest freeze-dryer for a minimum of 8h at 0.1 mbar in a chamber at -80ºC, until complete dehydration. Dried soil was disaggregated and homogenized by shaking. A homemade setup was used to stream sieving and weighting efficiently (Fig. 1A), and to remove small stones and other organic structures from the sample. Around .25 mg were weighted using a precision scale. The exact weight was recorded for later normalization of the read count.
Table 1
samples (n = 12) included in the study, from two consecutive agricultural campaigns in strawberry plantations, collected at the beginning and end of the campaigns.
samples | date collected | timepoint | campaign |
A-C | 2019-09-11 | beginning beginning | 1 |
D-F | 2019-10-17 | 1 |
G-I | 2020-05-14 | end | 1 |
J-L | 2021-05-25 | end | 2 |
DNA extraction and quantification of the copy number for the target markers
DNA extractions were done with the DNeasy PowerSoil Kit, following manufacturer’s instructions, in two batches: a first one with a non-spiked subset of samples for qPCRs, and a second one with spiked soil samples for sequencing. Both included a positive control containing only the spike-in (no soil) (Fig. 1B) and blanks of extraction. The spike-in sample consisted of 10 µL of the commercial spike-in (ZymoBIOMICS™ Spike-in Control I # D6320) and 20 µL of a 106 cells/ µL solution of the yeast Yarrowia lipolytica. It was added directly to the homogenization tube from the DNA extraction kit. The bacterial spike-in had cells from the halophilic bacteria Imtechella halotolerans (106 cells/ µL) and Allobacillus halotolerans (106 cells/ µL), unlikely to be found in these agricultural soils. For the fungus, the yeast Yarrowia lipolytica strain CECT 1240 was selected from the Spanish Collection of Type Cultures (www.uv.es) because it is a unicellular organism that can be counted in plates, and it is easy to grow and resuspend in desired concentrations to be used as a cellular spike-in. Yarrowia lipolytica spike-in stock solutions were prepared at concentrations of 106 and 107 ufc/ml (S1). DNA concentrations were measured with the NanoDrop 1000. The number of 16S copies per genome varied in the spiked bacteria: 7 copies/genome in A. halotolerans and 3 copies/genome in I. halotolerans. ITS2 copy number for Y. lipolytica was unknown, so copy estimates were related to the number of cells.
A qPCR was run on twelve DNA extractions corresponding to four selected representative samples (A, D, G and J) that had been extracted in three technical replicas; first batch of DNA extractions with no spike-in. The DNA extract from the spike-in solution was included as a standard. Ribosomal regions from fungi and bacteria commonly used in metabarcoding were selected. For fungi, primers ITS3 (5’ GCATCGATGAAGAACGCAGC 3’) and ITS4R (5’ TCCTCCGCTTATTGATATGC 3’) (White et al. 1990) were selected to amplify the ITS2 region, whereas primers 341F (5’ CCTACGGGNGGCWGCAG 3’) and 805R (5’ GACTACHVGGGTATCTAATCC 3’) (Herlemann et al. 2011) were chosen to target the V3-V4 region of bacterial 16S rRNA. Before the qPCR, all soil DNA extracts were diluted to 1/25 to prevent PCR inhibition (Sidstedt et al. 2020), which had been previously detected in this sample set. qPCRs were performed in final volumes of 10 µL, containing 1x of SensiFAST SYBR, 0.5 µM of each primer, 0.2 mg/mL of BSA, and 3.9 µL of template DNA (12–60 ng). The cycling conditions consisted of an initial denaturation at 95°C for 5 min, followed by 35 cycles of 95°C for 5 s, 58°C for 20 s, and 72°C for 40 s. The plate was read at the end of each extension step. A melting curve analysis was added after the 35 cycles, from 65 ºC to 95 ºC, + 5 ºC / 5 s to verify a unimodal peak in fragment sizes. All samples were included in technical duplicates for the 16S and the ITS primer sets, together with standards for relative quantification. The standard consisted of the DNA extraction from the spike-in in serial dilutions by a factor of 5: 1, 1/5, 1/25, 1/125, 1/625, and 1/3125. The qPCR was carried out on the Bio-Rad CFX Connect Real-Time System, and relative quantification was obtained with the software Bio-Rad CFX Manager 3.1. The number of copies in the samples was estimated from the linear regression of the Ct against the natural logarithm of starting quantities of the known dilutions for the spike-in.
Estimation of the input spike-in to target a given proportion of sequencing reads
The amount of spike-in to add to the soil samples to target a given ratio of the sequencing reads can be approximated from the qPCR results and the relations below (Eqs. 1–6). Locus-specific copies in the DNA extract (C) is the sum of spike-in (Cs1) plus the wild copies (Cw) (Eq. 1); and C is proportional to wild copies (Cw) divided by their proportion in the pool (1-R), where R is the ratio of spike-in copies with respect to the total (Eq. 2):
(Eq. 1)
$$\text{C}={\text{C}}_{\text{s}1}+{\text{C}}_{w}$$
(Eq. 2)
$$\text{C}=\frac{{C}_{w}}{1-\text{R}}$$
Eq. 1 and Eq. 2 can be equalized:
(Eq. 3)
$${\text{C}}_{\text{S}1}+{\text{C}}_{w}=\frac{{\text{C}}_{w}}{1-\text{T}};{\text{C}}_{\text{S}1}=\frac{{\text{C}}_{\text{w}}}{1-\text{R}}-{\text{C}}_{w}$$
The yield from a DNA extraction of a known amount of pure spike-in (no soil) can be estimated via qPCR (output/input). Assuming constant DNA yields across a batch of DNA extractions, this ratio can be used to approximate the number of locus copies in soil samples from the same batch. So, the ratio between the spike-in added to a soil sample (V0) and its estimated copies via qPCR (Cs0), is approximately equal to the ratio between the spike-in needed to add to any other sample (V1) to get the desired number of copies (Cs1). V1 can be adjusted to the desired ratio of spike-in/wild copies the DNA extract:
(Eq. 4)
$$\frac{{\text{V}}_{0}}{{\text{C}}_{\text{s}0}}\approx \frac{{V}_{1}}{{\text{C}}_{{s}_{1}}};{\text{V}}_{1}\approx \frac{{\text{V}}_{0}}{{\text{C}}_{s0}}{C}_{{\text{S}}_{1}}$$
Assuming the above ratio associated with the qPCR remains approximately constant in library preparation plus sequencing, the term Cs1 from Eq. 4 can be substituted with Eq. 3:
(Eq. 5)
$${\text{V}}_{1}\approx \frac{{\text{V}}_{0}}{{\text{C}}_{s0}}\left(\frac{{C}_{w}}{1-R}-{C}_{w}\right)$$
The dilution factors of the DNA templates from the spike-in (D0) and sample (D1) at the time of qPCR quantification often differ. These factors derive mainly from elution volumes during DNA extractions and the preparation of the qPCR reaction. These are incorporated into Eq. 6. In addition, V1 can be normalized to the amount of soil sample used for the DNA extraction (M). So, the final units of V1 refer to the mass of soil:
(Eq. 6)
$${V}_{1}\approx \frac{{V}_{0}}{{C}_{S0}}\left(\frac{{C}_{\text{w}}}{1-\text{R}}-{C}_{w}\right)\frac{{\text{D}}_{0}}{{\text{D}}_{1}M}$$
Summary of parameters (schematic representation in Fig. 1B):
V0: volume of spike-in solution added to pure spike-in DNA extraction [μL].
Cs0: spike-in copies from marker i in pure spike-in DNA extraction [copies].
C: copies from marker i in the DNA extract.
Cw: wild copies from marker i in DNA extract [copies].
Cs1: spike-in copies from marker i in DNA extract [copies].
R: ratio of spike-in copies with respect to the total.
V1: volume of spike-in necessary to get desired R [μL].
D0: dilution factor of spike-in DNA extract in qPCR.
D1: dilution factor of biological sample DNA extract in qPCR.
M: soil mass used in DNA extraction [g]
A ready-to-use template spreadsheet to estimate initial amounts of spike-in from qPCR results (Eq. 6) and a worked example are provided in S2. This template was used to estimate the volume of spike-in to add to the soil samples A-L (Table 1) before DNA extractions.
DNA extraction of samples with spike-in added, library preparation, and sequencing
After these calculations, a new spike-in solution was prepared containing a 2:5 vol/vol of bacterial ZymoResearch spike-in and a 107 cells/mL suspension of the yeast Y. lipolytica. From this solution, 5–7 µL were added to each soil sample in the homogenization tube (first step in the DNA extraction), to a total of twelve soil samples from the same plot in a strawberry plantation from two consecutive years (Table 1). DNA of extraction was carried out as in section 2.1. DNA from two of these samples was extracted in three technical replicas. Two blanks of extraction and a sample containing only the spike-in solution were included. A total of nineteen bacterial 16S and fungal ITS2 amplicon libraries were prepared and sequenced in Illumina NovaSeq PE250, as in Rodriguez-Mena et al. (2022), targeting 100K reads for 16S libraries and 50K reads for fungal libraries.
Determination of sequence variants and taxonomic profiling
Locus-specific primers and Illumina adapters were trimmed with cutadapt 2.1 (Martin 2011). Sequencing reads were further filtered with dada2 (Callahan et al. 2016) in R 4.1.3 (R Core Team 2022), with minimum filtering parameters: filterAndTrim(maxN = 0, maxEE = 3, truncQ = 2, minLen = 50). Bacterial 16S reads were truncated at 220 nt. However, no truncation length was applied to fungal ITS2 sequences, since ITS is very variable in size and biologically relevant variants below the read length had been detected in preliminary analysis. A custom R script was used to compute error rates from binned quality scores of the NovaSeq system (github.com/ErnakovichLab/dada2_ernakovichlab). ASV determination was carried out on forward and reverse reads with dada function, followed by merging paired-end reads and ending with the removal of chimeric sequences (details in github.com/csmiguel/spike-in). ASVs were iteratively clustered into KTUs (K-mer Taxonomic Units), reducing the sparsity of the taxon-abundance matrix and the number of variants into optimal biological relevant units (Liu et al. 2022). A newer beta version of the KTU package (KTU2: https://github.com/poyuliu/KTU2) was used to stream the dada2 workflow in R directly into the clustering algorithm and reduce memory burden. Taxonomic ranks were assigned to each ASV (i.e. KTU) down to “genus” level using idTaxa (Murali et al. 2018) from DECIPHER 2.22.0 package (Wright 2016). For bacterial reads, the SILVA 138 prokaryotic small subunit rRNA taxonomic training set (Quast et al. 2012) was used, whereas for fungal reads I used the UNITE v2020 database (Kõljalg et al. 2020), which contains the eukaryotic nuclear ribosomal ITS region. KTU taxon-abundance matrix (OTU table hereon), sample metadata, and DNA sequences were stored in phyloseq 1.38.0 objects (McMurdie and Holmes 2013) in R. KTUs with no phylum assigned or classified as eukaryotic organelle were discarded. Potential contaminant KTUs in blanks were removed with decontam 1.10.0 (Davis et al. 2018). Low abundant KTUs (relative abundances < 5e-5) and/or present in only one sample were also removed.
Estimation of absolute abundances
At this point in the bioinformatics workflow, the data is compositional. A central goal is to transform data from compositional to quantitative. The amount of spike-in added to each sample and their fraction of reads in the filtered ASVs can be used to normalize counts. A custom R script was used to detect the ASVs of the spiked organisms. Assuming the ratio of spike-in in sequencing reads is similar to its ratio in the sample (Eq. 7), the number of wild (non spike-in) biological units (Cw) can be approximated, and normalized to the initial amount of biological sample (Eq. 8). If N internal standards are used (i…iN) for the same marker, the average Cw (Cwav) can be approximated with Eq. 8bis.
(Eq. 7)
$$\frac{{\text{C}}_{w}}{{\text{C}}_{\text{s}1}}=\frac{{R}_{\text{w}}}{{\text{R}}_{\text{s}}}$$
(Eq. 8)
$${C}_{w}=\frac{{R}_{w}{C}_{s1}}{M{R}_{s}}$$
(Eq. 8 bis)
$${C}_{wav}=\frac{{R}_{w}}{M}{\sum }_{i=1}^{N}\frac{{C}_{{s}_{i}}}{{R}_{{s}_{i}}N}$$
R s , reads in sample classified as spike-in i.
C s , spike-in units in inoculated sample.
R w , reads in sample classified as wild reads (non spike-in).
C w , wild units in sample.
Disregarding CNV, quantitative biological units in the sample (OTUabs) can be calculated as the total estimated counts (Cwav) multiplied by a numeric vector with their proportions (OTUrel) (Eq. 9).
(Eq. 9)
$$\overrightarrow{OT{U}_{abs}}=\overrightarrow{OT{U}_{rel}}.{C}_{{w}_{av}}$$
Quantitative data for bacteria referred to 16S copies per gram of soil, whereas for fungi, units referred to the unknown ITS2 copy number present in Y. lipolytica per gram of soil.
Evaluation of internal standards
The correlation between absolute frequencies estimated from both bacteria spike-ins (I. halotolerans and A. halotolerans) was evaluated with Pearson’s r.
To further evaluate the predictability of the proposed workflow, observed proportions of spike-in in the sequencing reads (R observed) were compared to the expected ones (R expected) from the suspension added to the sample (V1). Observed proportions were calculated dividing spike-in reads by total filtered reads. Expected R’s were calculated from qPCR results after clearing the term R from Eq. 6. They were calculated for each sample independently since its value depends on the exact amount of the spike-in suspension added to the sample and approximate marker wild copies (Cw) guessed from the qPCR. For samples that had not been quantified by qPCR, expected target ratios were computed by extrapolating average Cw from the measured ones. In the case of bacteria, R was split among the two species in the spike-in based on the CNV of the 16S rRNA: 3 copies in I. halotolerans and 7 copies in A. halotolerans. Comparisons were expressed symmetrically around 0 by taking base 2 logarithm of the fold change of observed proportions divided by expected proportions. Log2 ratios of fold change from samples measured by qPCR were compared to the extrapolated ones with t-student tests.
Alpha and beta diversity. Procrustes superimposition
Alpha diversity was assessed with phyloseq::estimate_richness. Absolute abundances of bacterial phyla and fungus classes were visualized with stacked bar plots.
Beta diversity was evaluated with Multidimensional Scaling (MDS) on Bray-Curtis (BC) distances (phyloseq::ordinate) on the OTU-tables with compositional and absolute data of bacteria and fungi. Composite data was transformed by multiplying feature proportions (sum 1) by the geometric mean of absolute counts per sample. The logarithm was applied to both, transformed compositional and quantitative data. Then, symmetric Procrustes analyses with vegan::procrustes (Oksanen et al. 2020) were run on the first two dimensions of the MDS to compare the effect of using compositional versus quantitative abundance in beta diversity. The non-randomness (significance) of the MDS configuration between relative and absolute beta diversities was tested with the vegan::protest function with 999 permutations, and the m12 correlation was reported (Peres-Neto and Jackson 2001).