Site description. We analyzed bacterial isolates from sediment and water of nine ponds in the Pozas Rojas area of CCB (Figure 1). This site is composed of several small ponds (locally called pozas) that surround a larger pond in the system of Los Hundidos [30, 35]. These small ponds become hypersaline in summer [30], and used to have the highest stoichiometric unbalance (i.e., lowest P concentration) reported in CCB (C:N:P 15820:157:1)[35]. The ponds have seasonal high fluctuations in temperature (around 1 °C in winter to up to 60 °C in some summer moments in some cases)[35] and are small but permanent, separated from each other by ca. 9 meters or more, along an arch around the larger pond. However, the Pozas Rojas were flooded by hurricane Alex during summer 2010, merging most of the small ponds into a single large pond, until autumn 2011, when the water receded, leaving the moon shaped array of small red ponds at the same place (Figure 1).
Sample collection and strains isolation. We collected water and sediment samples in duplicate from nine ponds located in Pozas Rojas, Los Hundidos, CCB, during March 2013 and stored them at 4 °C until processing. Sediment was collected for nutrient analysis in 50 ml Falcon tubes and covered with aluminum foil before storage. Water was collected for nutrient quantification in 1 liter volumes and stored in the dark at 4 °C. Chemical analyses were performed at the Instituto de Investigaciones en Ecosistemas y Sustentabilidad, UNAM, in Morelia, Mexico. Cultivable strains from both sediment and water were isolated in PIA (Pseudomonas isolation agar) and TCBS (Thiosulfate Citrate Bile Sucrose Agar) as previously described [52, 65], obtaining a total of 174 isolates, being 88 isolates from sediment and 86 from water.
Environmental variables measurement. For nutrient quantification, sediment samples were dried, and water samples were filtered through a Millipore 0.42 µm filter. Total carbon (TC) and inorganic carbon (IC) were determined by combustion and colorimetric detection [66] using a total carbon analyzer (UIC model CM5012, Chicago, USA). Total organic carbon (TOC) was calculated as the difference between TC and IC. For total N (TN) and total P (TP) determination, samples were acid digested with H2SO4, H2O2, K2SO4 and CuSO4 at 360°C. Soil N was determined by the macro-Kjeldahl method [67], while P was determined by the molybdate colorimetric method following ascorbic acid reduction [68]. The N and P forms analyzed were determined colorimetrically in a Bran-Luebbe Auto analyzer 3 (Norderstedt, Germany).
DNA Extraction and PCR Amplification of 16S rRNA. For the 174 isolates obtained, DNA extraction was performed as described by Aljanabi and Martinez (1997) [69]. 16S rRNA genes were amplified using universal primers 27F (5′-AGA GTT TGA TCC TGG CTC AG-3′) and 1492R (5′-GGT TAC CTT GTT ACG ACT T-3′) [70]. All reactions were carried out in an Applied Biosystems Veriti 96 Well Thermal cycler (California, USA) using an Amplificasa DNA polymerase (BioTecMol, Mexico) with the following program: 94°C for 5 min, followed by 30 cycles consisting of 94°C for 1 min, 50°C for 30 s, 72°C for 1 min and 72°C for 5 min. Polymerase chain reaction (PCR) amplification products were electrophoresed on 1% agarose gels. Sanger sequencing was performed at the University of Washington High-Throughput Genomics Center.
Phylogenetic analysis of 16S rRNA sequences. The first 700 bps of the 16S rRNA gene, were aligned with Clustalw [71] and quality control was performed with Mothur [72]. Genera level identification was made using the classifier tool [73] from the Ribosomal Database Project (RDP) Release 11.4 [74] (Additional file 1: Table 3). Blastn searches were performed against Refseq database from NCBI to select reference sequences. A total of 101 sequences were identified as members of the Vibrionaceae family, 41 were isolates from water and 60 from sediment. These isolates were used in subsequent analyses. A maximum likelihood phylogenetic reconstruction was obtained with PhyML version 3.0 [75], using the HKY+I+G substitution model estimated with jModelTest 2 [76]. The degree of support for the branches was determined with 1,000 bootstrap iterations.
Environmental association of phylogroups. To test whether the community of cultivable strains was structured based on its isolation environment (i.e., water or sediment), we performed an AdaptML analysis [37], including our 101 isolates belonging to Vibrionaceae and an Halomonas spp. strain as an out-group. Three categorical environmental variables were tested, including pond of isolation, high and low nutrient concentrations, and the two sampled environments (water or sediment).
Genome sequencing, assembly, and annotation. For whole-genome sequencing, we selected from the AdaptML analysis 39 Vibrio spp. isolates, 23 isolated from sediment and 16 from water, plus 3 isolates of Photobacterium spp. (a lineage closely related to the Vibrio spp. genus) isolated from sediment. DNA extractions were performed with the DNeasy Blood and Tissue kit (Qiagen).
Sequencing was performed with Illumina MiSeq 2x250 technology, with insert libraries of 650 bps and an expected coverage of ca.10x per genome. At first, we planned an assembly strategy using a genome reference; for this reason, the strain V15_P4S5T153 had a second library that was designed using the Jr 454 Roche technology, in order to reduce sequencing bias and get higher coverage. However, due to divergence among genomes, we performed de novo assemblies for all genomes. All sequencing was performed at the Laboratorio Nacional de Genómica para la Biodiversidad (LANGEBIO), México.
The quality of raw reads was analyzed using FASTQC software (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). A minimum quality value of 25 was set, and low-quality sequences were removed with fastq_quality_filter from the FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/index.html). Adapter sequences were identified, removed and paired-end reads were merged using SeqPrep (https://github.com/jstjohn/SeqPrep). De novo assemblies were performed with Newbler (Roche/ 454 Life Sciences) using both single-end and merged reads.
For scaffolding process, we used SSPACE [77], gaps were closed using GapFiller [78] and final error correction was performed with iCORN [79] (Additional file 2: Table 10). Coding sequences were inferred with Prodigal 2.0 [80] implemented in PROKKA software [81]. InterProScan 5 allowed annotation [82] with the databases enabled by default. Genome completeness was assessed with BUSCO using the Gamma-proteobacteria database [38].
Pan-genome analyses. The 42 genomes from CCB where compared with genomes of 5 reference Vibrio spp. strains: Vibrio alginolyticus NBRC 15630 = ATCC 17749, V. anguillarum 775, V. furnissii NCTC 11218, V. parahaemolyticus BB22OP and V. metschnikovii CIP 69 14 (Additional file 1: Tables 4, 5; Additional file 2: Tables 10). Ortholog gene families were predicted from all 47 genomes using the DeNoGAP comparative genomics pipeline [83]. To minimize false positive prediction of orthologs, we assigned Photobacterium spp. genomes as outgroup. The completely sequenced genome of V. anguillarum strain 775 was used as seed reference.
We estimated the core genome based on presence and absence of gene families across the genomes. If the genes were present in all strains, the orthologs were classified as core, while genes were classified as accessory when present in more than one strain but not in all of them, and unique genes when it was present only in a single strain. Since most of the genomes in our dataset are not completely sequenced, we designated core ortholog families as those present in at least 95% of the genomes, to avoid the impact of missing genes due to sequencing or assembly artifacts.
The package Micropan [84] within R v.3.4 (R Core Team) [85] was used to infer the open or closed nature of each pan-genome dataset, following the heaps law proposed by Tettelin et al. [42]. The Heaps law model is fitted to the number of new gene clusters observed when genomes are ordered randomly. The model has two parameters: an intercept, and a decay parameter called alpha. If alpha is higher than 1.0 the pan-genome is considered closed, if alpha is lower than 1.0 it is considered open. Additionally, a random sub-sampling for each clade was made, taking three genomes and calculating the alpha value for each group of three genomes. A total of 1,000 independent sub-sampling events were made for each clade.
Core proteins were aligned using Kalign [86] to infer the phylogenetic relationship between the samples. The resulting alignments of individual ortholog families were concatenated using a custom Perl script. With these concatenated core genes, a maximum likelihood phylogenetic tree was constructed using the FastTree program [87].
Recombination analyses. Of the total ortholog families in the Vibrio spp. pan-genome, we only used the ortholog families found in at least three genomes for the recombination analyses. Genetic recombination was examined on each coding sequence (CDS) alignment by using inference of pairwise recombination sites, obtained with GENECONV [88] and by the identification of putative recombinant sequences through breakpoints using GARD [89].
Based on the number of recombination events, we estimated the events shared among isolates of the same pond and environment, among isolates of the different pond and environment, among isolates of the same pond and different environment and among isolates of different pond and environment. For this, we normalized the data by pan-genome size, number of strains and branch length. Given that the large generalist Clade II presented a clear sub-structure, we did a separated analysis for the shorter branches within Clade II (Additional file 1: Figure 4).
To assess the impact of homologous recombination, we analyzed the substitution pattern using two different algorithms, Gubbins [90] and ClonalFrameML [43]. A whole-genome alignment for the 47 analyzed genomes was performed with MAUVE [91]. The resulting alignment was used as input for Gubbins [90] using RAxML [92] and default parameters. Additionally, whole genome alignments were performed for each clade, excluding references, with the progressive MAUVE algorithm [91]. We calculated the R/theta ratio, nu and delta [43] for each sample and for 100 bootstrapped replicates.
Genetic structure of Clade II. Recombination analyses showed that in Clade II there are internal groups with higher internal recombination, so we decided to further investigate the structure within Clade II. For clustering analyses, we used Nei’s genetic distance [93] and Neighbor Joining. Genomes with distance less than 0.001 were grouped and tested with a discriminant analysis of principal components of the genetic variation, using the adegenet library in R [94]. For this study, we used 20 principal components and 3 discriminant functions.
Selection analyses. We used FUBAR [95] to identify signatures of positive selection among ortholog gene families found in at least three genomes. We accounted for recombination breakpoints in the ortholog families, while calculating positively selected sites based on GARD results [89]. We considered any site to be positively selected if it showed P-value < 0.05. We also conducted a Gene Ontology (GO) enrichment analysis using topGO [45] to find overrepresented biological functions in this set of genes.
Effective population size estimation. We followed a simulation approach to estimate the posterior distribution of the effective population size (Ne) of each of the six clades. According to the previous clustering and recombination analysis, for Clades I, III, IV, V and VI we simulated a single population, while for Clade II we simulated three sub-populations that diverged from an ancestral population.
Simulations were performed using Fastsimcoal2 [44, 96]. For each clade, we simulated DNA sequences having a similar length equal to the number of nucleotides in the given clade, as well as a sample size equal to the number of sequences sampled for each clade. We assumed no recombination within the genome, and used the Escherichia coli mutation rate of 2.2x10-10 mutations per nucleotide per generation [97]. We ran between two and four simulations for each clade. For the initial runs, we generated 100,000 replicates extracting Ne values from a prior log-uniform distribution that ranged from 100,000 to 20,000,000 individuals. For Clade II, we also estimated the age of divergence of each Sub-clade, by setting the prior distribution of time ranging from 1,000-4,000,000 generations. After a first run, we narrowed the prior ranges based on those simulations that had similar summary statistics compared to the observed data and performed another 100,000 simulations using the narrowed priors.
To compare the previously simulated and observed data based on summary statistics, we used the ape [98] and pegas [99] libraries in R to estimate the number of polymorphic sites and the Tajima’s D based on the entire genomes. Tajima’s D is commonly used to estimate demographic changes in populations [100, 101]. Also, we obtained 1,000 sliding windows frames to estimate the Tajima’s D along the genomes, as well as the mean and standard deviation of Tajima’s D. Tajima’s D, π, and Watterson’s theta (θw) were estimated for each clade as well as for Sub-clades A, B and G. Since clades I and VI had three sequences and it was not possible to obtain Tajima’s D, we did 1,000 replicates in which we subsampled with replacement 10 sequences. For each replicate, we calculated Tajima’s D and we obtained as the proximate value the median estimated across the 1,000 replicates.
Based on the summary statistics, we used the abc function in the ABC package [102] in R to calculate the distribution of the Ne parameter based on a 0.05 % threshold distance between the simulated and observed data. For each clade, we report the median and the 95% interval confidence of Ne. For Clade II, we further reported the average and 95% interval confidence of the number of generations since each Sub-clade diverged from an ancestral clade.
Association between genotypes and environmental variables. We evaluated whether the genetic variation within the Vibrionaceae genomes could be explained by particular adaptations to the environment (water or sediment). We used progressiveMauve [91] to perform a global multiple alignment between the assembled genomes. We extracted the variant sites within the alignment and exported them as SNPs using snp-sites [103].
We obtained 38,533 SNPs, which we used to search for private alleles using Poppr [104]. Afterwards, we obtained a subset of 25,892 SNPs by filtering biallelic sites with minor allele frequencies > 0.05. We used PLINK [105] to perform a GWAS to detect possible associations between our SNP set and either the water or sediment environments. We conducted Fisher exact tests and regarded as significant all SNPs whose associations had p-values < 0.01 after Bonferroni corrections. These analyses may be informative even considering these sampling differences [106, 107].
To test whether these associations could be explained by convergent evolution rather than by common ancestry, we compared an UPGMA tree reconstructed from the total set of SNPs from an UPGMA tree using only the SNPs that were significantly associated to the environment. We analyzed the distribution of the SNPs within the genomes to find the genes associated to those SNPs.
We mapped the SNPs positions in the genome alignment moving by 1 Kb windows; this window size was selected considering the average bacterial gene size and retrieved all the associated genes. We conducted a Gene Ontology (GO) enrichment analysis using topGO [45] to find overrepresented biological functions in this set of genes.