Sample collection, data quality control and verification of microbial content
We collected samples from coastal regions of both the Atlantic Ocean (west part of the English Channel – Roscoff, France, August 2017) and the south part of the North Sea (Wassenaarseslag, the Netherlands, July 2017 and August 2018). From here on, we refer to these as samples 1, 2 and 3, respectively. MinION 48-hour sequencing runs on every sample resulted in three datasets, particularly for sample 1 data statistics appear relatively suboptimal compared to data from laboratory cultures (Fig. 1A). We used the top 3 longest reads to assess data quality (additional file 1) and used 16S rRNA primers to confirm microbial DNA isolates (additional file 2).
Read length and quality distributions of MinION sequencing runs
Seawater characterization using k-mer classification
Using OneCodex [26] we generated classification trees for the three datasets. These are built from raw sequencing data and indicate the taxonomic relation between the detected microbial classes. This relation is based on taxonomic identifiers (taxids) provided by the NCBI taxonomy database.
Despite the fact that a large part of all three datasets could not be classified (47%, 69% and 38% for sample 1, 2 and 3, respectively) (additional file 7), all taxonomic trees highlight the complexity of microbial communities present at a single site. None of our three datasets reveal an overall dominant species, the largest differences between samples appear at low abundances. However 4.46% (sample 1), 15.66% (sample 2) and 7.82% (sample 3) of classified reads belong to Planktomarina temperata (Fig. 1B and additional file 3, red node), which is therefore the most abundant species present in the three data sets combined. Please revise additional file 3 for more highlights on classification trees of all three samples
Comparison of Onecodex species classification between different locations and time interval and overall identified ranks
The taxonomic levels assigned by OneCodex range from kingdom down to species-specific. Reads that cannot be linked to a particular taxonomic level are labelled ‘no rank’. In total 1,750, 3,017 and 2,007 taxids are assigned to the data of sample 1, 2 and 3, respectively. More than half of the ranks that OneCodex was able to classify are assigned to species level (Fig. 2B) in all three samples.
Interestingly, at least 484 microbes are identified in all samples (Fig. 2A). Some highlights include: 92 different Flavobacteriaceae bacterium and Flavobacteriales bacterium strains; 19 different Candidatus Pelagibacter strains; 18 Pelagibacteraceae bacterium and 6 SAR strains. This indicates that these communities are less time and location dependent compared to the 262 and 1,127 species that were found exclusively in France or Dutch areas, respectively. Furthermore, 607 and 129 species are exclusively observed in the Netherlands. As they exist at different times, they provide an initial impression of the time-dependent dynamics of these local communities. Finally, 135 and 77 species could be identified that are present at both locations, however only detectable at particular times. This could be an indication that even over large areas microbes are subject to time regulated dynamics.
Metagenomics assembly on raw sequencing data and blast verification on the top-3 longest contigs
In an attempt to verify OneCodex classification results as well as to assess the current metagenomics assemblers capabilities we subsequently assembled the three datasets separately. We have assembled our complex metagenomics datasets with Flye and retrieved 256, 1,735 and 968 contigs with mean coverage of 14x, 13x and 10x from samples 1, 2 and 3, respectively (Table 1). Notably, although it has higher coverage, assembly results from sample 2 did not exceed results from sample 3. On the contrary, sample 3 resulted better average contig length, maximum contig length and N50 values compared to sample 2 (Table 1).
Table 1
Assembly stats | France (1) | The Netherlands ’17 (2) | The Netherlands ’18 (3) |
contigs | 256 | 1,735 | 968 |
length (bp) | 8,678,102 | 107,863,873 | 94,117,952 |
min length (bp) | 2,432 | 536 | 494 |
mean length (bp) | 33,898 | 62,169 | 97,229 |
max length (bp) | 219,363 | 1,098,797 | 1,648,106 |
N50 | 40,621 | 75,928 | 153,524 |
Impressively, Flye was able to reconstruct a full genome from our third sample: 75% of our 1.6 Mbp contig aligns with 80% identity to Candidatus Thioglobus singularis of which its complete genome is a single circular chromosome of 1.7 Mbp (additional file 4). Additionally we show that OneCodex was able to identify certain species only using assembly results (additional file 5).
Data quality of unclassified reads and additional in silico PCR analysis
Poor read quality and relatively short read lengths could be a potential reason explaining why OneCodex was unable to classify taxids. Therefore, we investigated quality and length of unclassified reads (additional file 6). These statistics indicate that, in theory, these reads should provide OneCodex with sufficient information to resolve classifications. That OneCodex was not able to classify these reads, even to the most general taxonomic levels (such as kingdom or phylum) adds to the notion that these reads originate from species that are novel.
Inspection of low complexity regions in unclassified reads using tandem repeat analysis
An additional circumstance that might explain why reads are left unclassified is the presence of low complexity regions such as repeat elements. We have analysed the presence of repeat elements with Tandem Repeat Finder [22] in raw sequencing data and compared these to repeat counts of the unclassified reads. In none of our samples did we observe an increased presence of repetitive elements, on the contrary, the repetitive element count is lowered in every case (additional file 8).