Summary of BMM analysis for Ag1000G
The pipeline efficiently computed genome posterior probabilities (i.e. the “output BMM”) for over 147 billion sequence reads from 1,142 individual specimens that comprise the Ag1000G Phase I and Phase 2 datasets (Table 1).[15, 16] These reads were compared against a database of >600,000 reference genomes (at time of analysis) spanning the entire tree of life. The pipeline considered vertebrates, plants, protozoans, chromists, and archaea references, in addition to bacteria, viruses, and arthropods, to estimate taxon abundance profiles per mosquito.[17] Less than one percent of all reads (0.861%) failed assignment (with edit distance 20 or better) to any sequence present in sequence databases at the time of analysis. In the following summary, we say “a read was assigned to a taxon” to mean that a given read had the highest probability of coming from a given taxon, even though the output BMM presents possible alternatives and their corresponding probabilities.
93.0% of reads were placed in the phylum Arthropoda,which includes Anopheles gambiae mosquitoes. In addition, we detected a considerable number of reads assigned to chordates, bacteria, and bacteriophage (Table 1). Specimens associated with chordates included reads assigned to hominid, bovid, canid, equid, and phasianid hosts (Table 2). These samples contained reads covering at least 25% of the chordate reference genomes. All specimens contained reads assigned to human sequences, however the number of reads varied widely between specimens suggesting that some of these represented blood meals taken by the mosquitoes prior to capture, while others were likely the result of specimen contamination. An exceedingly small proportion of reads were assigned to plants, fungi, and other taxa. Less than one percent of all reads (0.861%) failed alignment to any sequence present in sequence databases at the time of analysis.
Anopheles gambiae species complex mosquitoes
We evaluated BMM assignments of samples that were morphologically and genomically verified by the Ag1000G project, to members of the An. gambiae complex, which are evolutionarily similar. On average, the An. gambiae complex represented 93.3% of the probability mass given to Arthropod-assigned reads, with the remaining mass scattered around other anophelines. BMM provided easily interpretable and accurate estimates at this taxonomic level for all samples. Next, we evaluated BMM probabilities at the species level. Higher An. gambiae probabilities corresponded to An. gambiae samples (and similarly for An. coluzzii probabilities and samples), though probabilities were more evenly distributed across these taxa expressing greater uncertainty in the estimates (Fig. 1). Straightforward selection of clusters from BMM statistics correctly grouped the Ag1000G samples at the species level with 96.4% accuracy (Fig. S2). In summary, our model provided useful probabilities suggesting robust interpretation of noisy alignment data across references with varying size, quality, and homology. Uncertainty increased at the species level which encourages more careful inspection and interpretation of model probabilities. As genomic databases such as The Darwin Tree of Life Project, VectorBase and the i5K initiative and others expand, we expect the uncertainty of assignments to accordingly decrease.[18-20]
Vertebrate Hosts
Our model assigned vertebrate reads to human as well as several domesticated animals including cow, goat, dog, donkey, and fowl, which all comprise common livestock species found in rural settings across Africa (Table 2).[21] Although some human reads may be attributed to blood meals acquired from humans, it is difficult to discern human contamination from field and laboratory handling of individual specimens versus a human-derived blood meal. In contrast, the signal assigned to other vertebrates are likely to represent blood meals from host-fed mosquitoes. These signals represent over 613 million reads, sharing greater than 99% identity with reference genomes clearly indicating blood feeding by these mosquitoes had occurred on singular and in some cases, multiple hosts. 162 specimens had at least 25% coverage of a Chordata reference genome. Of these, 24 specimens were non-human hosts, and 138 specimens were human hosts (Table 3). In several cases the pipeline assigned reads to hosts based on very high coverage rates. For example, sample ERS224451 contained 125,897,307 assigned reads covering 84% of the Capra aegagrus (wild goat) genome.[22] Sample ERS224085 contained 49,266,543 assigned reads covering 84% of the Equus asinus asinus (donkey/ass) genome and sample ERS224472 contained 71,332,843 assigned reads covering over 69% of the human genome.[23, 24]
Given the high degree of anthropophily generally observed in An. gambiae and An. coluzzii, the abundance of non-human host signals is greater than expected (Fig. 2).[25, 26] This finding suggests that An. gambiae and An. coluzzii may be more opportunistic feeders than has previously been appreciated.[25, 27, 28] This may also reflect the relative abundance and diversity of hosts available to host-seeking mosquitoes in the sites where specimens were collected as well as the method of collection and handling of specimens prior to sequencing. All these factors must be considered when interpreting the results and further indicate the need for accurate and extensive metadata at the time of collection.[29]
Plasmodium
The BMM pipeline assigned 5,344,273 reads (0.0036%) to seven Plasmodium parasite species (with varying probabilities) distributed among 485 of the 1142 mosquitoes. However, the number of reads per sample varied widely. To further examine the presence of Plasmodium falciparum, the most lethal and primary human malaria Plasmodium species transmitted by An. gambiae, sequence reads from each sample were realigned using SNAP aligner against a single P. falciparum reference (GCA_001861075.1).[30] As the P. falciparum nuclear genome consists of low complexity sequences (80.6% A+T) which can result in ambiguity in sequence assignments and likely was the reason for assignment to seven parasite species, we assessed the coverage of the P. falciparum core genome (hypervariable and subtelomeric regions excluded, 20.8 Mb; relative to the apicoplast genome, 35 Kb).[31, 32] The P. falciparum apicoplast genome has a higher copy number compared to its nuclear counterpart (15:1 ratio), therefore it is expected to have an increased sensitivity of detection.[33, 34] 432 and 148 samples of the 1,142 mosquitoes contained sequence reads mapping to the P. falciparum core genome and the apicoplast, respectively (Fig. 3a). As expected, all specimens with apicoplast reads had a greater depth and a higher percent coverage than core genome assemblies, reflecting the disparity in sizes and copy numbers between the two genomes. We discovered that some of these suspected Plasmodium reads originated from six anopheline specimens morphologically identified as male mosquitoes. Whereas only female mosquitoes feed on blood, the assignment of P. falciparum reads to male mosquitoes should be considered erroneous due to contamination or mislabeling. All samples with reads aligned to P. falciparum apicoplast also contained reads aligned to the core genome, however the reverse was not true. Read validation was further accomplished by establishing a threshold of the coverage for the apicoplast and core genomes in relation to the P. falciparum characteristic guanine-cytosine (GC) genome content. We determined that the GC content of P. falciparum consensus sequences was distinctly lower from contigs found in the Ag1000G An. gambiae metagenomes and specifically when compared to bacterial taxa (Fig. 3b). Furthermore, the correlation between the GC content and the percent of genomes coverage denoted a distinct threshold in genome coverage above which the sequences had consistent GC content and within the estimated interquartile range (Fig.3b). Specifically, for the apicoplast and the core genomes the thresholds were estimated to be 3.0% (Fig. 3c) and 0.4% (Fig. 3d), respectively. Based on these thresholds, a total of 59 samples (5.6%) had validated coverage for both parasite genomes and were considered true positives for P. falciparum, while 339 samples had no validated coverage and are likely false positives (Table S1). All male mosquitoes were in the latter category.
Viruses
The BMM pipeline assigned 2,039,560 reads (0.001%) to 80 species of viruses and bacteriophage distributed among 223 of the 1142 mosquitoes. Eukaryotic viral sequences were found in 65 samples, bacteriophage-related sequences were present in 167 samples, while both eukaryotic viruses and bacteriophage were detected in 10 samples (Tables S2 & S3). Analysis revealed that many of these detections were false positives due to physical contamination or computational misassignment. Close examination of viral coverage maps, and sequencing flow cell history suggested that the human immunodeficiency virus (HIV) and influenza sequences detected were the consequence of contamination, most likely stemming from previous sequencing runs using the same flow cell. No known mosquito viruses were detected in any of the samples. However, nineteen mosquitoes contained authenticated vertebrate viral sequences, including fifteen specimens containing human hepatitis B virus (HBV), a single specimen containing ungulate erythroparvovirus-1, and three specimens containing primate erythroparvovirus-1 (Figs S3-S5). These viruses are not known to replicate in mosquitoes; therefore, HBV reads detected were most likely present in the blood meal. In support of this notion, all fifteen HBV positive samples, as well as the three samples harboring primate erythroparvovirus-1, contained human DNA while the sole sample in which ungulate erythroparvovirus-1 was detected also contained bovine DNA.
To further examine viral sequences detected by the BMM pipeline, all reads from each sample were assembled and examined by Integrator, an extension of the Premonition pipeline for virus and microbe detection that probes amino acid similarities (Tables S4 & S5).[35] Integrator confirmed the presence of HBV in twelve samples, as well as the presence of ungulate erythroparvovirus-1 in a single sample (Figs. 4, S1, S2, Table S6) and plots were generated in Circos.[36] Furthermore, the assembled contigs contained ORF structures consistent with HBV and ungulate parvovirus presence. Some samples contained near complete genomes of HBV and ungulate erythroparvovirus-1 (Fig. 5). Integrator also uncovered numerous previously unidentified bacteriophage (Table S7). In addition, Integrator found sequences with distant similarities to known mosquito viruses such as Anopheles annulipes orbivirus and Wuhan insect virus 23. However, these are RNA genome viruses and may be present as integrated partial viral genomes.
Bacteria
We analyzed bacterial taxa present in the Ag1000G Phase 1 and 2 data sets. The BMM analysis assigned approximately 0.6% of all sequence reads to bacteria (Table 1). Reads associated with bacteria were present in all 1142 samples, however, the number of bacterial reads per sample varied widely (Fig. 6a). Bacterial sequences may originate from microorganisms associated with the living mosquito specimens, microorganisms that grow postmortem on a preserved specimen, or due to contamination during nucleic acid preparation and sequencing. Therefore, we also examined the data following removal of taxa commonly associated with contamination.[37] Furthermore, we only analyzed samples with fewer than one million total bacterial reads and families with at least five thousand reads. This arbitrary cutoff was selected based on when 1) the number of bacterial reads flattened out, and 2) examination of low read samples revealed suspected contaminants. This reduced the number of samples containing bacterial sequences to 478 of 1,142. These reads were distributed among 59 bacteria families (Fig. 6b). The presence of bacteria in Ag1000G mosquitoes was confirmed by analysis with Integrator. Bacterial contigs were only found in samples that BMM identified as harboring bacterial reads. The bacterial phyla and genera identified by Integrator were similar to those detected by BMM (Table S2). As proof of concept, we examined two bacterial species, Elizabethkingia anophelis and Thorsellia anophelis in detail since both have been associated with the Anopheles microbiome.[38, 39] Elizabethkingia anophelis was detected in 35 specimens and Thorsellia anophelis in 42 of the 1,142 specimens (Figs. 6c & 6d). Greater than 95% coverage of these bacterial genomes was achieved in a subset of the samples. These bacteria accounted for most of the bacterial species detected in some samples with no samples having signatures of both bacterial taxa. In addition to bacteria, BMM/Integrator analysis also identified bacteriophage contig sequences. Most bacteriophages found in nature are novel species distantly related to sequences in databases and therefore, most of these novel species will be missed by BMM because of alignment requirements. This is consistent with the results of Integrator analysis that identified multiple contigs encoding bacteriophage-related proteins. Our approach does not distinguish between bacteriophage sequences present because of an ongoing infection from those that are integrated in bacterial genomes. In total, bacteriophage-related contigs were detected in 27 samples, all of which also contained a high number of bacterial reads (Fig. 6a). In three of the 27 samples, Integrator detected and assembled bacteriophage contigs despite not having any BMM-assigned bacteriophage reads.