Analysis protocol: filtering using UMI-based error rates
The resulting paired-end reads were merged and then separated by gene prior to downstream analysis, where UMIs are critical in two ways. Firstly, sequences are clustered by UMI, and the number of unique UMIs is counted for each distinct sequence, selecting the most abundant sequence associated with each UMI (Fig. 1C). UMIs are discarded as ambiguous if the most abundant sequence does not have at least two reads more than the next in abundance. The most abundant sequence will usually be the correct one (Fig. 2A Case 1) but, because most UMIs are represented by just a small number of reads, it can sometimes happen that an erroneous sequence is sampled more often than the true sequence, so the primary sequence of the UMI becomes this erroneous sequence (Fig. 2A Case 2). Secondly, we reasoned that it may be possible to eliminate these errors by using the UMIs to provide information on global error rates across all samples. We implemented this in MAUI-seq by noting both the most abundant (primary) and the second most abundant (secondary) sequence if two or more sequences were associated with the same UMI. MAUI-seq then distinguishes between true and erroneous sequences based on the ratio of primary and secondary occurrences of each sequence, eliminating sequences that show a high ratio (default is 0.7) of secondary to primary occurrences (Fig. 1C and Fig. 2B). The 0.7 threshold was chosen empirically, based on the ratios observed for known true and erroneous sequences, but it is a compromise because the incidence of secondary sequences varies across genes and studies. An examination of the results may suggest choosing different thresholds in other studies. Finally, globally rare sequences are discarded (default threshold is 0.1% averaged across samples - a lower threshold could be used if samples were sequenced to a greater depth). Python scripts for separating the genes and for the UMI analysis are available at https://github.com/jpwyoung/MAUI.
Validation using purified DNA mixed in known proportions
We first evaluated the accuracy of MAUI-seq by profiling DNA mixtures with known strain DNA ratios. DNA was extracted from two Rlt strains differing by a minimum of 3 bp in each of their recA, rpoB, nodA, and nodD amplicon sequences, and the extracted DNA was mixed in different ratios (Supplementary Table S1). After amplification and sequencing, assembled reads were assigned to their target gene and analysed using MAUI-seq and two programs frequently used for de-noising of amplicon sequencing data, DADA2 and UNOISE3 [19, 21]. Since rare sequences have a high error rate, we discarded (for each of the three methods) sequences that fell below a threshold frequency of 0.1% of accepted sequences. The observed and expected strain ratios were highly correlated for all four genes across the three analysis methods, and we found that the performances of the proofreading (Phusion) and non-proofreading (Platinum) polymerases were gene-dependent, which could be due to differences in amplification efficiency for the four templates (Table 1 and Supplementary Figures S1-S4). On average, MAUI-seq detected between 98.5% and 100% true sequences exactly matching those of the two strains in the mixture, while DADA2 ranged from 89.7–100%, and UNOISE3 from 79.8–100% (Table 1). The better performance of MAUI-seq was due to more effective elimination of chimeras, which were especially abundant when the PCR reaction was carried out using the Platinum non-proofreading polymerase (Table 1 and Supplementary Figures S1-S4). For the proofreading polymerase, DADA2 detected 100% true sequences for all four genes, whereas MAUI-seq detected 99.03% for nodA, failing to eliminate three rare sequences that did not have sufficient secondary counts. This suggests that DADA2 performs equally well or even slightly better than MAUI-seq, when a proofreading polymerase is used to amplify DNA from a simple, two-component mix. The prevalence of secondary sequences varied with gene and polymerase: the secondary/primary ratio for accepted sequences was 0.0322 for rpoB using Phusion, but just 0.0002 for nodD using Platinum. When the ratio was very low, there were insufficient secondary counts for MAUI-seq to eliminate erroneous sequences effectively.
Validation using environmental samples
To test the method on more complex samples, we compared Rlt populations in root nodules from two locations in Denmark, a clover trial station in Store Heddinge on Zealand and a lawn at Aarhus University in Jutland (the Field-Samples-1 dataset; Supplementary Figure S5). One hundred nodules were pooled for each sample and each plot was sampled in four replicates. Platinum Taq polymerase enzyme was used for amplification. Each clover root nodule is usually colonised by a single Rhizobium strain, so a maximum of 100 unique sequences per gene is expected per sample.
For Field-Samples-1, the total number of distinct sequences for MAUI-seq and DADA2 were in the same range as the number of distinct alleles observed in a population of 196 natural European Rlt isolates (Table 2). In contrast, UNOISE3 produced a substantially higher number of distinct sequences, suggesting that its default filtering might be too lenient for our data (Table 2). The sequences accepted as true by MAUI-seq were nearly all also included in the DADA2 and UNOISE3 outputs (Fig. 3). On the other hand, DADA2 and UNOISE3 both accepted a number of sequences that were filtered out by MAUI-seq, and many of these were eliminated by MAUI-seq because a high ratio of secondary to primary occurrences strongly suggested that they represent errors and not real sequences (Fig. 3 and Additional file 2). To provide independent evidence as to whether sequences were likely to be genuine, we checked whether they matched (or differed by a single nucleotide from) known sequences in either a reference database of 196 natural European Rlt isolates , or the NCBI whole-genome shotgun database (Fig. 3). The great majority of sequences rejected by MAUI-seq did not have exact matches to these known sequences. A few sequences that exactly matched known alleles were included by DADA2 and UNOISE, but not by MAUI-sEq. These sequences were not reported by MAUI-seq because their UMI counts were below the abundance threshold, not because the secondary/primary occurrence filter identified them as erroneous (Fig. 3). The count threshold could be lowered to include rarer sequences, if the study required it.
The allele frequency distributions were different at Aarhus and Store Heddinge (Fig. 3), and the two sites were clearly separated by the first principal component in a Principal Component analysis (PCA) for MAUI-seq, DADA2 and UNOISE3 sequences. (Fig. 4 and Supplementary Figure S6-S8). The amplicon sequencing has sufficient resolution to characterize geospatial variation in allele frequencies. For example, MAUI-seq, DADA2 and UNOISE3 can all clearly identify several highly abundant sequences from one location that are either absent or present in very low frequency in samples from the other location (Fig. 3). To quantify the genetic differentiation between the Aarhus and Store Heddinge sites, we calculated fixation indices (FST). Considering all four target genes combined, the MAUI-seq output resulted in the highest FST value followed by DADA2 and UNOISE3 (Table 2, Fig. 4 and Supplementary Figure S9-S11). For all individual genes, MAUI-seq also produced the highest FST estimates, and the differences were especially pronounced for nodA, which also showed the highest overall level of differentiation (Table 2 and Supplementary Figure S9-S11). The lower genetic differentiation estimated based on DADA2 and UNOISE3 results, compared to those of MAUI-seq, reflects the inclusion of an increased number of erroneous sequences, which are less differentiated between the two sampled sites than the real sequences (Fig. 3).
Since it was clear from the DNA mixture experiment that the choice of DNA polymerase could significantly affect error rates, we sampled root nodules from 13 additional clover field plots (the Field-Samples-2 dataset) and amplified each sample (a pool of one hundred root nodules) using Platinum and Phusion polymerases in parallel. For samples amplified using Platinum, MAUI-seq detected fewer sequences than DADA2 and UNOISE3 for the two core genes, but the same number of reference sequences were detected (Table 3). DADA2 included two chimeric sequences that were filtered out by MAUI-seq due to a high ratio of secondary to primary occurrences (Additional File 2). UNOISE3 detected twice as many sequences as DADA2 and MAUI-seq for the accessory genes, but most of the additional sequences had no associated UMIs and were classified as “other” (Table 3, Additional File 2). For samples amplified using Phusion, MAUI-seq and DADA2 detected a similar number of sequences (Table 3). All nine UNOISE3 rpoB sequences that were not accepted by either MAUI-seq or DADA2 (Additional File 2) are putative chimeric sequences with two parental sequences of higher abundance. For nodA, MAUI-seq includes three sequences that have a single nucleotide difference from a reference sequence, but all have a good ratio of secondary to primary reads, so we hypothesise that these are true sequences. Some reference or exact blast hit sequences were included by DADA2 but not by MAUI-seq because their abundance was estimated by DADA2 to be above the 0.001 threshold, but MAUI-seq estimated that they were rarer.
Both MAUI-seq and DADA2 identify and remove sequences that appear to be errors (base substitutions or chimeras), but they use completely different evidence. As a result, they do not always make the same decision, as illustrated for a small set of representative data in Table 4 (the rpoB sequences amplified by Phusion). While DADA2 examines the sequences and rejects those that are likely to be generated from more abundant sequences in the sample, MAUI-seq does not use the actual sequence but bases decisions on how frequently a sequence occurs as a secondary sequence with the same UMI as another (primary) sequence. Sequences ranked 5 and 6 (Table 4) are both potential chimeras of the more abundant sequences 1–4. Both DADA2 and MAUI-seq reject sequence 6 and accept sequence 5. Sequence 6 has a secondary/primary ratio of 103/118, which is above the default threshold of 0.7, so MAUI-seq rejects it as a likely error. On the other hand, the ratio for sequence 5 is 71/229. This is well below the threshold, but it is higher than other sequences with a similar primary count, e.g. sequence 9 (15/270). A possible explanation is that some of the reads for sequence 5 are generated as chimeras but others are genuine, since is entirely plausible that new alleles are generated by recombination between existing alleles. To some extent, MAUI-seq compensates for this because it allocates sequence 5 a relatively low count and hence lower ranking (8) than it has in the raw reads or the DADA2 analysis. There are two further sequences, 10 and 29, that are rejected by DADA2 as potential chimeras but accepted by MAUI-seq (Additional file 2 Field-Samples-2-phusion-rpoB); in both cases they have secondary sequence counts well below the threshold, so MAUI-seq accepts them as genuine. DADA2 included an rpoB sequence that does not have any associated UMIs (sequence 41), and appears to be a chimera of two more abundant sequences (sequence 3/4/5 and sequence 11) (Table 4). MAUI-seq counts UMIs, not individual reads, and the default setting is to require that the primary sequence has at least two more reads than the next most frequent sequence (if any) that has the same UMI. This enriches for genuine sequences, which are generally more abundant than errors, but it means, of course, that the number of counts is much lower than the number of reads. In fact, for this particular set of data, the number of UMIs is orders of magnitude smaller than either the raw reads or the DADA2 count, although still sufficient to provide good estimates of the relative abundance of the sequences that make up the bulk of the population. The main reason for the low UMI count is that the number of reads per UMI was suboptimal in these data for the rpoB gene: only 18% of the UMIs had more than one read, and MAUI-seq discards single-read UMIs by default. By contrast, in the equivalent data for the recA gene in the same study (Additional file 2 Field-Samples-2-phusion-recA), 37.5% of UMIs had more than one read, making more effective use of the available sequence reads.