Identication and Removal of Potential Contaminants in 16S rRNA Gene Sequence Datasets from Low Microbial Biomass Samples: An Example from the Mosquito Gut

Background: The bacterial gut microbiota of the female mosquito inuences numerous physiological processes, including vector competence. As a low-microbial-biomass ecosystem, mosquito gut tissue is prone to contamination from the laboratory environment and from reagents commonly used to dissect and/or isolate DNA from gut tissue. In this report, we analyze ve 16S rRNA datasets, including new data obtained by us, to gain insight into the impact of potential contaminating sequences on the composition, diversity, and structure of the mosquito gut microbial community. Results: We present a clustering-free approach that, based on the relative abundance of amplicon sequence variants (ASVs) in gut and negative control samples , allowed for the identication of candidate contaminating sequences. Some of these sequences belong to bacterial taxa previously identied as common contaminants in metagenomic studies; they have also been identied as part of the mosquito core gut microbiota, with putative physiological relevance for the host. By using different relative abundance cutoffs, we show that contaminating sequences have a signicant impact on gut microbiota diversity and structure. Conclusions: The approach presented here allows the identication and removal of purported contaminating sequences in datasets obtained from low-microbial biomass samples. While it was exemplied with the analysis of gut microbiota from mosquitos, it can easily extend to other datasets dealing with similar technical artifacts. or 10% in control samples. Previous studies have employed this approach based on two assumptions: (i) that contaminating sequences have frequencies that correlate inversely with the DNA concentration of the samples; (ii) that contaminating sequences have a higher prevalence in control retrieved referred to gene


Background
Mosquitoes have a resident bacterial gut microbiota that is fundamental for several physiological processes such as larval growth, blood digestion, and immune function [1][2][3], making gut bacteria a prospective avenue for reducing vector competence [4]. For this reason, a multitude of studies have described microbial gut communities of mosquitoes from relevant medical genera (e.g. Aedes, Anopheles, and Culex), in different settings (labreared or eld-collected), and in different physiological contexts (e.g. blood-fed, virus-or parasite-infected) (reviewed in [1,2,5]). These studies have demonstrated the taxonomic diversity of bacteria present in mosquito gut tissues, even at the intraspeci c level. Most of the identi ed bacteria belong to the Proteobacteria phylum (e.g., Acinetobacter, Aeromonas, Asaia, Comamonas, Enterobacter, Klebsiella, Pantoea, Pseudomonas, Serratia, among others) [5].
Parallel to advances of the past decade in microbiota identi cation techniques that have allowed the description of mosquito gut microbiota in greater detail, interest has grown in understanding and describing the impact of contaminants introduced by molecular reagents, such as microbiota from DNA extraction kits (referred to as the "kitome" [6-8]), PCR master mix [7,9], laboratory facilities [8,10] and technical issues such as well-towell contamination [11] or index switching in sequencing platforms [12,13]. Interestingly, many contaminating sequences belong to bacteria commonly associated with mosquito gut tissues, including Acinetobacter, Chryseobacterium, Enterobacter, or Pseudomonas [6,8,14]. This raises a question about the identi cation of these bacteria in the mosquito gut: is it an actual presence or the signal from undesired contamination? To counter the effect of contamination, especially when studying low-biomass samples, where contaminants can become dominant in the sampling [14], researchers have proposed precautionary measures, such as randomizing sample types and treatment groups, decontaminating working areas, and sequencing of negative controls (e.g. sampling blanks and DNA extraction reagents) and positive controls (e.g. mock communities) [6, 8, [14][15][16].
Interest in standardizing the methodologies has grown for mosquito microbiota research [17,18], recognizing that mosquito tissues are low-biomass samples, expected to be prone to sequencing artifacts [17]. However, no consensus approach has been developed for the identi cation and removal of possible contaminating sequences from mosquito tissues to date. Published studies in mosquito gut microbiota that have reported the sequencing of negative control samples (Additional Table 1), only a few have addressed the quanti cation and reduction of putative contaminating sequences. One reported the complete removal of sequences detected in controls [19] and two others reported the removal of shared OTUs with relative abundances at least 10 times greater in control samples compared to tissue samples [20,21]. Unfortunately, the taxonomic identity of the removed OTUs in these studies was not reported.
In this report, we examine mosquito gut microbiota datasets from some previously published studies and add a newly generated dataset obtained by us that included several negative controls, in order to identify and remove purported contaminating sequences. We next quanti ed the impact of contaminating sequences on gut microbiota diversity and structure. Finally, we develop a simple strategy for the removal of contaminating sequences from mosquito gut tissue samples.

Results
We analyzed ve datasets (Table 1) using the same pipeline to remove low-quality sequences (i.e. sequences that did not align with the analyzed region, chimeric and/or non-bacterial sequences), subsequently de ning amplicon sequence variants (ASVs; sequences clustered at 100% identity, see Methods). For negative control samples, we determined the total number of ASVs and those with relative abundance thresholds of ≥ 1%, 5%, and 10% ( Table 2). We also quanti ed the proportion of ASVs shared between negative control and tissue samples. In four datasets, these shared ASVs represented more than 10% of sequences obtained in the tissue samples (66.84% in Aedes, 56.76% in Albopictus, 56.68% in Anopheles2 and 11.79% in Anopheles1), while in the Aegypti dataset, shared ASVs represented less than 1%. As expected, increasing the relative abundance threshold reduced the number of shared ASVs, as well as their presence in tissue samples. For the 10% threshold, the contamination abundance in tissue samples was 38.94% in Aedes, 18.68% in Anopheles2, 4.38% in Anopheles1, and 2.76% in Albopictus.
Next, we examined how removing of sequences found in control samples affected the composition and structure of tissue samples using the following treatments: complete removal of ASVs found in negative controls, and removal of ASVs with abundance thresholds ≥ 1%, 5%, or 10%.
Given the differences in targeted 16S rRNA fragments, DNA extraction methods, and the variety of negative controls utilized in each study (Table 1), alpha and beta diversities were not directly comparable across the datasets. Therefore, we compared changes in microbial diversity before and after ASV removal treatments in each dataset separately. For alpha diversity, we observed signi cant reductions in the number of OTUs for all datasets after total removal of the sequences found in control samples-the most drastic treatment, but also with the other removal treatments, except for the Aegypti dataset (Fig. 1A). The Shannon diversity index also showed statistically signi cant changes with the total removal treatment (except for Albopictus and Anopheles1 datasets) and with the threshold removal treatments (except for Aegypti) (Fig. 1B). For the evenness (Pielou's index), we observed changes in three datasets-Aedes, Anopheles1, and Anopheles2-the rst two with signi cant differences for all removal treatments ( Fig. 1C). For beta diversity, we used the Jaccard and Bray-Curtis indices. Like the alpha diversity analysis, statistically signi cant differences against the non-removal treatment were observed for all treatments (Fig. 2).

Discussion
With our survey of available mosquito gut datasets and a new one reported here, we highlight the impact of potential contaminants on the composition, structure, and diversity of low-microbial biomass samples. We used a clustering-free approach to make precise identi cation of potentially contaminating sequences. This strategy works well for this purpose because contaminants are likely very speci c, as has been previously demonstrated [22,23]. Our removal strategy recognizes their uniqueness and puts forward a way to trim them that is not dependent on reference taxonomic databases. Therefore, it can be implemented in different datasets dealing with DNA sequences and extended to metagenomic pipelines.
The use of a clustering-free approach allows separating potential contamination and actual sequences with the same taxonomy but having a different biological origin. In our study, we found that abundant ASVs detected in negative controls were classi ed within Acinetobacter, Chryseobacterium, Enterobacter, or Pseudomonas. These genera have previously been described as common contaminants [6][7][8]14]. However, other ASVs classi ed in these same genera were found in tissue samples and were not detected in negative controls. This illustrates the advantage of our clustering-free approach, as these bacterial groups have been reported as part of the core microbiota in Aedes and Anopheles mosquitoes [1,2,5], having putative functional roles in their host. Enterobacter has hemolytic activity associated with blood digestion and egg production [24]; Enterobacter and Pseudomonas reduce vector competence for Plasmodium infection [25,26] and LaCrosse virus [27]; Acinetobacter and Chrysobacterium may contribute to larval development [28].
Not only have we identi ed purported contaminating sequences, but we also evaluated trimming strategies to reduce their effects on data analysis.
Almost all removal strategies affected microbial inference, a major result that highlights the impact of ignoring contamination and the crucial role of negative controls to remove potential sources of noise. The rst strategy used, removing all the sequences found in negative controls, is considered a very conservative method, where it is preferable to pay the cost of eliminating the true positives than to keep contaminants in the nal dataset. This loss of biological data due to cross-contamination between experimental and the negative controls samples (e.g. well-to-well contamination [11] or index switching [12]) is a reason some authors discourage using this method, suggesting that removal of sequences present in negative controls should only be performed when it can be ensured that they correspond to actual contaminants [16] and propose the use of alternative methods [14] (see below).
Distinguishing laboratory or reagent contamination from cross-contamination with experimental samples is very challenging in low-microbial biomass samples, where the presence of ASVs with low abundance are ubiquitous. Our analysis showed that most ASVs found in negative controls had low abundance (≤ 1%), and some were identi ed as symbionts, suggesting potential cross-contamination from tissue samples to negative controls. For instance, the endosymbiont Wolbachia was found in control samples in the Aedes, Albopictus and Anopheles2 datasets in low abundance (< 1%). Thorsellia, another bacteria reported as a natural mosquito symbiont [29][30][31], was present in the control sample of the Anopheles2 dataset with an abundance of 1.54% (Table 3). As de Gouffau et at. [15] pointed out, ecological data should be considered when evaluating if these unexpected results make sense; in this instance, they do not.
Another approach tested here was the use of abundance thresholds for the removal of contaminating sequences. We observed that similar to the complete removal treatment, there were signi cant changes in alpha and beta diversities after removal of sequences with relative abundances ≥ 1%, 5%, or 10% in control samples. Previous studies have employed this approach based on two assumptions: (i) that contaminating sequences have frequencies that correlate inversely with the DNA concentration of the samples; (ii) that contaminating sequences have a higher prevalence in control samples than in experimental samples [32]. However, these assumptions are not valid in the analysis of samples with low-microbial biomass, where contaminating sequences can dominate the entire library. Some authors have employed this approach in the study of mosquito-associated microbiota. For instance, Minard et al. [20] removed all shared OTUs with relative abundances at least 10 times greater in control samples than in tissue samples. Instead, we showed that microbial inference is severely affected by the removal of contaminants with varying abundance thresholds. However, we do not consider the total removal of sequences found in negative controls or any prede ned abundance threshold as universal to determine contamination. Each study needs to de ne its proper criteria according to the data obtained from sequencing and the quality of the controls used.
Complementary to the strategy proposed here, it is necessary to establish additional measures to identify and reduce contamination in the analysis of low-microbial biomass samples [6, 14,16]. These procedures include: (1) maximizing the starting sample biomass by choice of sample type, ltration or enrichment, (2) randomization of samples and treatments to avoid batch/day effects, (3) recording batch numbers of reagents, (4) sequencing of many negative controls that cover all sample processing steps (i.e. dissection, DNA extraction, library preparation), (5) sequencing of positive controls (e.g. mock community, high-biomass samples with known composition) that can help to detect cross-contamination and (6) reporting negative control sequences in genomic repositories, along with tissue sample sequences.

Conclusion
Our analysis of mosquito gut microbiota datasets revealed the common presence of contaminant sequences that signi cantly affected the composition, diversity, and structure of the inferred microbial community. To minimize this impact, we proposed a clustering-free approach to precisely identify potential contaminants and evaluated different abundance thresholds to gauge the impact of high and low abundance ASVs on the inferred microbial community. This strategy should be complemented with laboratory protocols to identify and reduce sample contamination including as many controls as possible.

Data acquisition
We analyzed microbial DNA sequences from mosquito guts and control samples using two data sources: a newly developed dataset which we report here, and four datasets retrieved from previous studies. The new dataset of Aedes aegypti and Ae. albopictus gut samples (hereafter referred to as Aedes) was obtained from adults from lab colonies raised at the Max Planck Tandem Table 1) that matched our criteria. From this group, we selected ve datasets that included sequences from both gut samples and negative controls (blank and/or DNA extraction sample(s)). One dataset was discarded given their limited number of sequences in control samples [34]. The four remaining datasets used in our analysis were from Ae. aegypti (referred to as Aegypti) [21], Ae. albopictus (Albopictus) [20], An. gambiae/An. coluzzii (Anopheles1) [31] and An. darlingi/An. nuneztovari (Anopheles2) [35] (Table 1).

Microbiota analyses
For the ve datasets analyzed here, raw reads were processed following the SOP for MiSeq sequences of Mothur v. 1.43.0 [36]. Low-quality sequences were ltered out according to established parameters: (a) presence of ambiguous nucleotides, (b) sequences with more than 8 homopolymers, (c) sequence length lower than the 2.5% percentile, (d) sequence length higher than the 97.5% percentile. Remaining sequences were pre-clustered to reduce sequencing errors (allowing one difference every 100 bp), chimeras removed with VSEARCH [37], as well as non-bacterial sequences, based on a preliminary classi cation using the SILVA v132 database [38]. Singletons were removed from the nal dataset. For the Aedes and Albopictus datasets, data were normalized to 25,000 sequences per sample because of their high sequencing depth.
To evaluate the effect of removing contaminating sequences on microbial composition and diversity, we made the following analyses. First, we used a clustering-free approach to identify independent bacterial subpopulations shared by gut samples and negative controls. Then, each unique sequence (i.e. 100% nucleotide identity) was de ned as an amplicon sequence variant (ASV). Afterward, we used different relative abundance criteria for removing ASVs shared between negative controls and tissue samples. We obtained ve different subsets: (a) original dataset with no removal of ASVs found in negative controls, (b) removal of all ASVs present in control samples from tissue samples, (c) removal of ASVs with an overall relative abundance in control samples ≥ 1% (d) removal of ASVs with an overall relative abundance in control samples ≥ 5%, and (e) removal of ASVs with an overall relative abundance in control samples ≥ 10%. Finally, for each of the above subsets, we clustered sequences at 97% sequence identity to obtain standard OTUs using the OptiClust algorithm implemented in Mothur [39].
To evaluate the changes in microbiota composition, diversity, and structure in the subsets described above, we calculated alpha and beta diversity indices. Speci cally, we calculated the number of OTUs as a measure of microbial richness, Shannon diversity index, and Pielou's evenness index.
Also, we calculated the Jaccard and Bray-Curtis indices as measures of beta diversity. After evaluating the normality of the indices for each subset using a Shapiro-Wilk test, we used paired t-tests and Wilcoxon signed-rank tests to determine whether there was a statistically signi cant difference between alpha and beta diversity indices in the subsets where putative contaminants were removed compared to the original, unaltered datasets.