With our survey of available mosquito gut datasets and a new one reported here, we highlight the impact of potential contaminants on the composition, structure, and diversity of low-microbial biomass samples. We used a clustering-free approach to make precise identification of potentially contaminating sequences. This strategy works well for this purpose because contaminants are likely very specific, as has been previously demonstrated [22, 23]. Our removal strategy recognizes their uniqueness and puts forward a way to trim them that is not dependent on reference taxonomic databases. Therefore, it can be implemented in different datasets dealing with DNA sequences and extended to metagenomic pipelines.
The use of a clustering-free approach allows separating potential contamination and actual sequences with the same taxonomy but having a different biological origin. In our study, we found that abundant ASVs detected in negative controls were classified within Acinetobacter, Chryseobacterium, Enterobacter, or Pseudomonas. These genera have previously been described as common contaminants [6–8, 14]. However, other ASVs classified in these same genera were found in tissue samples and were not detected in negative controls. This illustrates the advantage of our clustering-free approach, as these bacterial groups have been reported as part of the core microbiota in Aedes and Anopheles mosquitoes [1, 2, 5], having putative functional roles in their host. Enterobacter has hemolytic activity associated with blood digestion and egg production ; Enterobacter and Pseudomonas reduce vector competence for Plasmodium infection [25, 26] and LaCrosse virus ; Acinetobacter and Chrysobacterium may contribute to larval development .
Not only have we identified purported contaminating sequences, but we also evaluated trimming strategies to reduce their effects on data analysis. Almost all removal strategies affected microbial inference, a major result that highlights the impact of ignoring contamination and the crucial role of negative controls to remove potential sources of noise. The first strategy used, removing all the sequences found in negative controls, is considered a very conservative method, where it is preferable to pay the cost of eliminating the true positives than to keep contaminants in the final dataset. This loss of biological data due to cross-contamination between experimental and the negative controls samples (e.g. well-to-well contamination  or index switching ) is a reason some authors discourage using this method, suggesting that removal of sequences present in negative controls should only be performed when it can be ensured that they correspond to actual contaminants  and propose the use of alternative methods  (see below).
Distinguishing laboratory or reagent contamination from cross-contamination with experimental samples is very challenging in low-microbial biomass samples, where the presence of ASVs with low abundance are ubiquitous. Our analysis showed that most ASVs found in negative controls had low abundance (≤ 1%), and some were identified as symbionts, suggesting potential cross-contamination from tissue samples to negative controls. For instance, the endosymbiont Wolbachia was found in control samples in the Aedes, Albopictus and Anopheles2 datasets in low abundance (< 1%). Thorsellia, another bacteria reported as a natural mosquito symbiont [29–31], was present in the control sample of the Anopheles2 dataset with an abundance of 1.54% (Table 3). As de Gouffau et at.  pointed out, ecological data should be considered when evaluating if these unexpected results make sense; in this instance, they do not.
Another approach tested here was the use of abundance thresholds for the removal of contaminating sequences. We observed that similar to the complete removal treatment, there were significant changes in alpha and beta diversities after removal of sequences with relative abundances ≥ 1%, 5%, or 10% in control samples. Previous studies have employed this approach based on two assumptions: (i) that contaminating sequences have frequencies that correlate inversely with the DNA concentration of the samples; (ii) that contaminating sequences have a higher prevalence in control samples than in experimental samples . However, these assumptions are not valid in the analysis of samples with low-microbial biomass, where contaminating sequences can dominate the entire library. Some authors have employed this approach in the study of mosquito-associated microbiota. For instance, Minard et al.  removed all shared OTUs with relative abundances at least 10 times greater in control samples than in tissue samples. Instead, we showed that microbial inference is severely affected by the removal of contaminants with varying abundance thresholds. However, we do not consider the total removal of sequences found in negative controls or any predefined abundance threshold as universal to determine contamination. Each study needs to define its proper criteria according to the data obtained from sequencing and the quality of the controls used.
Complementary to the strategy proposed here, it is necessary to establish additional measures to identify and reduce contamination in the analysis of low-microbial biomass samples [6, 14, 16]. These procedures include: (1) maximizing the starting sample biomass by choice of sample type, filtration or enrichment, (2) randomization of samples and treatments to avoid batch/day effects, (3) recording batch numbers of reagents, (4) sequencing of many negative controls that cover all sample processing steps (i.e. dissection, DNA extraction, library preparation), (5) sequencing of positive controls (e.g. mock community, high-biomass samples with known composition) that can help to detect cross-contamination and (6) reporting negative control sequences in genomic repositories, along with tissue sample sequences.