A metagenomic analysis of the phase 2 Anopheles gambiae 1000 genomes dataset reveals a wide diversity of cobionts associated with field collected mosquitoes

doi:10.21203/rs.3.rs-2667362/v1

Download PDF

Article

A metagenomic analysis of the phase 2 Anopheles gambiae 1000 genomes dataset reveals a wide diversity of cobionts associated with field collected mosquitoes

https://doi.org/10.21203/rs.3.rs-2667362/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

The Anopheles gambiae 1000 Genomes (Ag1000G) Consortium utilized deep sequencing methods to catalogue genetic diversity across African Anopheles gambiae populations. We analyzed the complete datasets of 1,142 individually sequenced mosquitoes through Microsoft Premonition’s Bayesian mixture model based(BMM) metagenomics pipeline. All samples were confirmed as either An. gambiae sensu stricto (s.s.) or An. coluzzii with a high degree of confidence (>98% identity to reference). Homo sapiens DNA was identified in all specimens indicating contamination may have occurred either at the time of sample collection, preparation and/or sequencing. We found evidence of vertebrate hosts in 162 specimens. 59 specimens contained validated Plasmodium falciparumreads. Non-mosquito borne, human hepatitis B and primate erythroparvovirus-1 viral sequences were identified in fifteen and three specimens, respectively. 478 of the 1,142 specimens were found to contain bacterial reads and bacteriophage-related contigs were detected in 27 samples. This analysis demonstrates the capacity of metagenomic approaches to elucidate important vector-host-pathogen interactions of epidemiological significance.

Biological sciences/Microbiology/Pathogens

Biological sciences/Ecology/Ecological epidemiology

Biological sciences/Microbiology/Virology/Metagenomics

Biological sciences/Microbiology/Bacteria/Metagenomics

Biological sciences/Microbiology/Parasitology/Parasite genomics

The goal of the Anopheles gambiae 1000 Genomes (Ag1000G) Project was to determine the genetic diversity and population structure of An. gambiae complex mosquitoes, the primary vectors of human malaria parasites throughout sub-Saharan Africa.[1, 2] Whole genome sequencing (WGS) of individual field-collected mosquitoes from thirteen African countries was used to identify the presence and distribution of single nucleotide polymorphisms (SNPs) conferring phenotypic traits such as reduced susceptibility to insecticides.[1] In addition to mosquito DNA, endogenous and acquired viral, bacterial, fungal, protozoan, and in cases where mosquitoes had blood-fed, vertebrate DNA sequences may also be present in data generated from these field-derived specimens.

Anopheles gambiae mosquitoes interact with a myriad of vertebrates, plants, and their associated microbiota at various stages of development and across a diverse range of habitats.[3] Shifts in An. gambiae gut microbiome composition is strongly correlated with transitions from aquatic larval habitats to terrestrial settings where adult mosquitoes are actively seeking nectar and vertebrate host blood to support flight and egg production.[4] Partially digested blood meals often contain intact and degraded RNA and/or DNA derived from vertebrate host(s) as well as pathogens and microorganisms present in the circulatory system of the host at the time of feeding.[5] Methods such as mitochondrial DNA barcoding, amplicon-based and shotgun metagenomic sequencing, and ELISA-based approaches have been successfully used to elucidate host feeding patterns of hematophagous insects.[6] Shotgun metagenomic sequencing offers significant advantages with respect to greater resolution and accurate identification of microbial genera and reduced primer bias as compared to amplicon sequencing approaches, including 16S rDNA methods.[7, 8] Metagenomic analyses of hematophagous insects and their blood meals represent a promising approach for rapid detection and identification of novel and established microbiota including pathogens and their respective vertebrate hosts.[9-12]

Recent advances in high throughput, deep sequencing and bioinformatics have given rise to metagenomic approaches for rapid, highly accurate resolution of complex environmental samples.[13, 14] Microsoft Premonition has developed a Bayesian mixture model based (BMM) metagenomics pipeline capable of identifying known taxa at the species level and estimating novel species present in a single specimen (Fig. S1). The pipeline utilizes 10 tera-base genomic reference database and cloud-scale statistical machine learning to quickly: (1) build probabilistic assignments from reads to species based on sequence similarity, (2) refine species probabilities for ambiguous reads by computing a global statistical model across all reads, and (3) identify novel, unexpected, and contaminant genetic material by aligning against all taxa with available (partial) genomic references, i.e. without a priori assumptions on which taxa might be present in a sample and without limiting the analysis to a small subset of genomic references (e.g. to only pathogens for computational reasons). To this end, we analyzed the publicly available Ag1000G Phase 1 and 2 datasets using the Microsoft Premonition metagenomics pipeline to determine the constituent species present in each of 1,142 field collected Anopheles gambiae mosquito samples.

Summary of BMM analysis for Ag1000G

The pipeline efficiently computed genome posterior probabilities (i.e. the “output BMM”) for over 147 billion sequence reads from 1,142 individual specimens that comprise the Ag1000G Phase I and Phase 2 datasets (Table 1).[15, 16] These reads were compared against a database of >600,000 reference genomes (at time of analysis) spanning the entire tree of life. The pipeline considered vertebrates, plants, protozoans, chromists, and archaea references, in addition to bacteria, viruses, and arthropods, to estimate taxon abundance profiles per mosquito.[17] Less than one percent of all reads (0.861%) failed assignment (with edit distance 20 or better) to any sequence present in sequence databases at the time of analysis. In the following summary, we say “a read was assigned to a taxon” to mean that a given read had the highest probability of coming from a given taxon, even though the output BMM presents possible alternatives and their corresponding probabilities.

93.0% of reads were placed in the phylum Arthropoda,which includes Anopheles gambiae mosquitoes. In addition, we detected a considerable number of reads assigned to chordates, bacteria, and bacteriophage (Table 1). Specimens associated with chordates included reads assigned to hominid, bovid, canid, equid, and phasianid hosts (Table 2). These samples contained reads covering at least 25% of the chordate reference genomes. All specimens contained reads assigned to human sequences, however the number of reads varied widely between specimens suggesting that some of these represented blood meals taken by the mosquitoes prior to capture, while others were likely the result of specimen contamination. An exceedingly small proportion of reads were assigned to plants, fungi, and other taxa. Less than one percent of all reads (0.861%) failed alignment to any sequence present in sequence databases at the time of analysis.

Anopheles gambiae species complex mosquitoes

We evaluated BMM assignments of samples that were morphologically and genomically verified by the Ag1000G project, to members of the An. gambiae complex, which are evolutionarily similar. On average, the An. gambiae complex represented 93.3% of the probability mass given to Arthropod-assigned reads, with the remaining mass scattered around other anophelines. BMM provided easily interpretable and accurate estimates at this taxonomic level for all samples. Next, we evaluated BMM probabilities at the species level. Higher An. gambiae probabilities corresponded to An. gambiae samples (and similarly for An. coluzzii probabilities and samples), though probabilities were more evenly distributed across these taxa expressing greater uncertainty in the estimates (Fig. 1). Straightforward selection of clusters from BMM statistics correctly grouped the Ag1000G samples at the species level with 96.4% accuracy (Fig. S2). In summary, our model provided useful probabilities suggesting robust interpretation of noisy alignment data across references with varying size, quality, and homology. Uncertainty increased at the species level which encourages more careful inspection and interpretation of model probabilities. As genomic databases such as The Darwin Tree of Life Project, VectorBase and the i5K initiative and others expand, we expect the uncertainty of assignments to accordingly decrease.[18-20]

Vertebrate Hosts

Our model assigned vertebrate reads to human as well as several domesticated animals including cow, goat, dog, donkey, and fowl, which all comprise common livestock species found in rural settings across Africa (Table 2).[21] Although some human reads may be attributed to blood meals acquired from humans, it is difficult to discern human contamination from field and laboratory handling of individual specimens versus a human-derived blood meal. In contrast, the signal assigned to other vertebrates are likely to represent blood meals from host-fed mosquitoes. These signals represent over 613 million reads, sharing greater than 99% identity with reference genomes clearly indicating blood feeding by these mosquitoes had occurred on singular and in some cases, multiple hosts. 162 specimens had at least 25% coverage of a Chordata reference genome. Of these, 24 specimens were non-human hosts, and 138 specimens were human hosts (Table 3). In several cases the pipeline assigned reads to hosts based on very high coverage rates. For example, sample ERS224451 contained 125,897,307 assigned reads covering 84% of the Capra aegagrus (wild goat) genome.[22] Sample ERS224085 contained 49,266,543 assigned reads covering 84% of the Equus asinus asinus (donkey/ass) genome and sample ERS224472 contained 71,332,843 assigned reads covering over 69% of the human genome.[23, 24]

Given the high degree of anthropophily generally observed in An. gambiae and An. coluzzii, the abundance of non-human host signals is greater than expected (Fig. 2).[25, 26] This finding suggests that An. gambiae and An. coluzzii may be more opportunistic feeders than has previously been appreciated.[25, 27, 28] This may also reflect the relative abundance and diversity of hosts available to host-seeking mosquitoes in the sites where specimens were collected as well as the method of collection and handling of specimens prior to sequencing. All these factors must be considered when interpreting the results and further indicate the need for accurate and extensive metadata at the time of collection.[29]

Plasmodium

The BMM pipeline assigned 5,344,273 reads (0.0036%) to seven Plasmodium parasite species (with varying probabilities) distributed among 485 of the 1142 mosquitoes. However, the number of reads per sample varied widely. To further examine the presence of Plasmodium falciparum, the most lethal and primary human malaria Plasmodium species transmitted by An. gambiae, sequence reads from each sample were realigned using SNAP aligner against a single P. falciparum reference (GCA_001861075.1).[30] As the P. falciparum nuclear genome consists of low complexity sequences (80.6% A+T) which can result in ambiguity in sequence assignments and likely was the reason for assignment to seven parasite species, we assessed the coverage of the P. falciparum core genome (hypervariable and subtelomeric regions excluded, 20.8 Mb; relative to the apicoplast genome, 35 Kb).[31, 32] The P. falciparum apicoplast genome has a higher copy number compared to its nuclear counterpart (15:1 ratio), therefore it is expected to have an increased sensitivity of detection.[33, 34] 432 and 148 samples of the 1,142 mosquitoes contained sequence reads mapping to the P. falciparum core genome and the apicoplast, respectively (Fig. 3a). As expected, all specimens with apicoplast reads had a greater depth and a higher percent coverage than core genome assemblies, reflecting the disparity in sizes and copy numbers between the two genomes. We discovered that some of these suspected Plasmodium reads originated from six anopheline specimens morphologically identified as male mosquitoes. Whereas only female mosquitoes feed on blood, the assignment of P. falciparum reads to male mosquitoes should be considered erroneous due to contamination or mislabeling. All samples with reads aligned to P. falciparum apicoplast also contained reads aligned to the core genome, however the reverse was not true. Read validation was further accomplished by establishing a threshold of the coverage for the apicoplast and core genomes in relation to the P. falciparum characteristic guanine-cytosine (GC) genome content. We determined that the GC content of P. falciparum consensus sequences was distinctly lower from contigs found in the Ag1000G An. gambiae metagenomes and specifically when compared to bacterial taxa (Fig. 3b). Furthermore, the correlation between the GC content and the percent of genomes coverage denoted a distinct threshold in genome coverage above which the sequences had consistent GC content and within the estimated interquartile range (Fig.3b). Specifically, for the apicoplast and the core genomes the thresholds were estimated to be 3.0% (Fig. 3c) and 0.4% (Fig. 3d), respectively. Based on these thresholds, a total of 59 samples (5.6%) had validated coverage for both parasite genomes and were considered true positives for P. falciparum, while 339 samples had no validated coverage and are likely false positives (Table S1). All male mosquitoes were in the latter category.

Viruses

The BMM pipeline assigned 2,039,560 reads (0.001%) to 80 species of viruses and bacteriophage distributed among 223 of the 1142 mosquitoes. Eukaryotic viral sequences were found in 65 samples, bacteriophage-related sequences were present in 167 samples, while both eukaryotic viruses and bacteriophage were detected in 10 samples (Tables S2 & S3). Analysis revealed that many of these detections were false positives due to physical contamination or computational misassignment. Close examination of viral coverage maps, and sequencing flow cell history suggested that the human immunodeficiency virus (HIV) and influenza sequences detected were the consequence of contamination, most likely stemming from previous sequencing runs using the same flow cell. No known mosquito viruses were detected in any of the samples. However, nineteen mosquitoes contained authenticated vertebrate viral sequences, including fifteen specimens containing human hepatitis B virus (HBV), a single specimen containing ungulate erythroparvovirus-1, and three specimens containing primate erythroparvovirus-1 (Figs S3-S5). These viruses are not known to replicate in mosquitoes; therefore, HBV reads detected were most likely present in the blood meal. In support of this notion, all fifteen HBV positive samples, as well as the three samples harboring primate erythroparvovirus-1, contained human DNA while the sole sample in which ungulate erythroparvovirus-1 was detected also contained bovine DNA.

To further examine viral sequences detected by the BMM pipeline, all reads from each sample were assembled and examined by Integrator, an extension of the Premonition pipeline for virus and microbe detection that probes amino acid similarities (Tables S4 & S5).[35] Integrator confirmed the presence of HBV in twelve samples, as well as the presence of ungulate erythroparvovirus-1 in a single sample (Figs. 4, S1, S2, Table S6) and plots were generated in Circos.[36] Furthermore, the assembled contigs contained ORF structures consistent with HBV and ungulate parvovirus presence. Some samples contained near complete genomes of HBV and ungulate erythroparvovirus-1 (Fig. 5). Integrator also uncovered numerous previously unidentified bacteriophage (Table S7). In addition, Integrator found sequences with distant similarities to known mosquito viruses such as Anopheles annulipes orbivirus and Wuhan insect virus 23. However, these are RNA genome viruses and may be present as integrated partial viral genomes.

Bacteria

We analyzed bacterial taxa present in the Ag1000G Phase 1 and 2 data sets. The BMM analysis assigned approximately 0.6% of all sequence reads to bacteria (Table 1). Reads associated with bacteria were present in all 1142 samples, however, the number of bacterial reads per sample varied widely (Fig. 6a). Bacterial sequences may originate from microorganisms associated with the living mosquito specimens, microorganisms that grow postmortem on a preserved specimen, or due to contamination during nucleic acid preparation and sequencing. Therefore, we also examined the data following removal of taxa commonly associated with contamination.[37] Furthermore, we only analyzed samples with fewer than one million total bacterial reads and families with at least five thousand reads. This arbitrary cutoff was selected based on when 1) the number of bacterial reads flattened out, and 2) examination of low read samples revealed suspected contaminants. This reduced the number of samples containing bacterial sequences to 478 of 1,142. These reads were distributed among 59 bacteria families (Fig. 6b). The presence of bacteria in Ag1000G mosquitoes was confirmed by analysis with Integrator. Bacterial contigs were only found in samples that BMM identified as harboring bacterial reads. The bacterial phyla and genera identified by Integrator were similar to those detected by BMM (Table S2). As proof of concept, we examined two bacterial species, Elizabethkingia anophelis and Thorsellia anophelis in detail since both have been associated with the Anopheles microbiome.[38, 39] Elizabethkingia anophelis was detected in 35 specimens and Thorsellia anophelis in 42 of the 1,142 specimens (Figs. 6c & 6d). Greater than 95% coverage of these bacterial genomes was achieved in a subset of the samples. These bacteria accounted for most of the bacterial species detected in some samples with no samples having signatures of both bacterial taxa. In addition to bacteria, BMM/Integrator analysis also identified bacteriophage contig sequences. Most bacteriophages found in nature are novel species distantly related to sequences in databases and therefore, most of these novel species will be missed by BMM because of alignment requirements. This is consistent with the results of Integrator analysis that identified multiple contigs encoding bacteriophage-related proteins. Our approach does not distinguish between bacteriophage sequences present because of an ongoing infection from those that are integrated in bacterial genomes. In total, bacteriophage-related contigs were detected in 27 samples, all of which also contained a high number of bacterial reads (Fig. 6a). In three of the 27 samples, Integrator detected and assembled bacteriophage contigs despite not having any BMM-assigned bacteriophage reads.

As projects like Ag1000G continue to expand the volume of genomic data available for the An. gambiae species complex across its range of distribution, we expect our mixture model to become increasingly accurate at resolving species-level assignments. Our Bayesian mixture model assigned vertebrate reads to human as well as several domesticated animals including goat, cow, dog, and donkey hosts. In addition, evidence of mixed blood meals derived from two host species were detected in several specimens. We were able to ascertain reads assigned to P. falciparum in several vector samples. The relatively low coverage of the 23 Mb Plasmodium parasite genome demonstrated challenges of detecting a small fragment of parasite in a large sample of host genome and mixed DNA templates. However, using detection of both Plasmodium core and apicoplast genomes proved to be a novel method to validate parasite presence. Yet, because whole mosquitoes were typically used for DNA extractions and sequencing, it is not possible to discriminate whether mosquitoes were infectious.

We were unable to ascertain whether the wide-spread presence of reads assigned to Homo sapiens in nearly all specimens was a consequence of human feeding or contamination during field collection, sorting and identification, laboratory manipulation and nucleic acid extraction or due to residual contamination of NGS flow cells between sequencing runs. Samples originated from many different collectors and were handled and extracted using multiple approaches. Some samples were stored in ethanol, while others were desiccated, some were extracted soon after collection and others after extended storage. Thus, bacteria may be present as part of the mosquito microbiome, phoretic on the external surface of the insects, or as contaminants introduced during collection/sample processing or even microbial growth during specimen storage. Thus, we term these as mosquito-associated bacteria. The enormous number of bacterial reads present in some samples suggests that bacteria were actively growing in some samples.

The microbiome of individual mosquitoes is relatively simple from this data set. Since bacteriophage cannot grow in insect cells, bacteriophage sequences should only be present in samples containing bacteria, however, we cannot distinguish bacteriophage infection from integrated phage genomes. All the bacteriophage detected were novel. The high degree of sensitivity of the NGS method underscores the need to preserve specimen integrity and standardize approaches from collection through analysis to accurately determine sequence identity and the nature of biological associations. Additionally, the results demonstrate the importance of targeting collections and metadata to address specific questions. We continue to investigate unusual genomic assignments for systematic contamination of reference databases and are developing disciplined methods to address reference contamination. This study shows that metagenomic analysis of mosquitoes provides a robust strategy for detecting and monitoring the host species from which mosquitoes obtain a blood meal, as well as protozoa, bacteria and viruses that are circulating among vertebrate hosts.

Introduction To Metagenomics Bayesian Mixture Model

Briefly, the Microsoft Premonition pipeline takes as input: (1) a sample \(S=\{{r}_{1},\dots , {r}_{m}\}\), which is a set of sequencer reads, and (2) a reference genome database \(Ref=\{{g}_{1},\dots , {g}_{n}\}\), which is a set of genomes. It computes the probability distribution \(p\left(r, g \right| S, Ref)\), which is the probability that read \(r\) in the sample \(S\) came from genome \(g\) in the reference database \(Ref\). This distribution is computed without assumptions on the species that may be present in the sample (and so every read is aligned to every genome in the reference). This is well-suited for environmental samples that have few biological constraints on the species that might be in a sample and that may contain genomic fragments from many species with low genome coverage.

The uncertainty in the resulting probability distribution can indicate: (1) uncertainty of species due to sequence similarity, (2) the presence of novel species where reads are unlikely to have come from any genomes in the available references, and (3) genome coverage patterns that are consistent with non-biological artifacts – as well as other phenomena. Though the full description of the pipeline is outside the scope of this paper, the resulting probability distribution is a Bayesian mixture model (BMM), which globally optimizes the probability of each read coming from a genome based on statistical evidence gathered from other reads in the sample. In the context of the Ag1000G project, this allows the pipeline to suppress the probability that low complexity or highly conserved anopheline reads might have come other anopheline species based on the overwhelming evidence for An. gambiae s.s. or An. coluzzii coming from other unambiguous reads in the sample. At the same time, some small probability can be assigned to these less likely interpretations, so they are available for consideration.

Various quantities can be derived from this statistical model. For simplicity, we will consider only a few of these quantities here:

For every read \(r\), the genome posterior probability\(p\left(g \right| r, S, Ref)\) gives the probability that genome \(g\) contributed read \(r\) to the sample. These probabilities sum to one for each read, so they can also be treated as fractionally mapping reads to genomes. For instance, a read\(r\) may have \(p\left(An. gambiae\right| r, S, Ref)=0.7\) and \(p\left(An. coluzzii\right| r, S, Ref)=0.3\), with all other genomes having zero probability, indicating \(r\) was more likely to have come from the An. gambiae genome than the An. coluzzii genome under modeling assumptions.
The expected number of genome reads is the expected number of reads a genome \(g\) contributed to a sample and is the sum of all fractional reads mapped to that genome, i.e. \({E}_{reads}\left[g\right]= \sum _{r\in S}p\left(g \right| r, S, Ref)\). This can be extended to the expected number of reads contributed by an arbitrary taxon. Given a taxon \(tx\) let \(g\left(tx\right)=\left\{{{g}_{tx}}_{1},\dots ,{{g}_{tx}}_{k}\right\}\subseteq Ref\) be the set of all genomes in the reference database that belong to that taxon. Then, the expected number of reads for that taxon is \({E}_{reads}\left[tx\right]= \sum _{g\in g\left(tx\right)}{E}_{reads}\left[g\right]\). For example, \({E}_{reads}\left[diptera\right]-{E}_{reads}\left[Anopheles\right]\)gives the expected number of reads contributed from non-anopheline dipterans.
The \({ϵ}\)-genome coverage and \({ϵ}\)-genome percent identity of a genome \(g\) are the genome coverage and genome percent nucleotide identity calculated using the set of all reads with genome posterior probability greater than or equal to \(ϵ\), i.e. \(\left\{r\in S \right|\)\(p\left(g \right| r, S, Ref)\ge ϵ\}\). Choosing a smaller value for \(ϵ\) yields a higher coverage because more reads are considered, but a lower value for percent identity because more divergent reads are included. Posterior credible intervals could also be defined. For this presentation “x% of reads were placed in taxon” means the percentage of expected reads contributed by that taxon to the total number or reads in that sample, i.e. \(100\times \frac{{E}_{reads}\left[tx\right]}{\left|S\right|}\).

Metagenomic Analyses

The Bayesian Mixture Model (BMM) pipeline was applied to DNA sequencing datasets of 1,142 mosquitoes from the Ag1000G Phase 1 and 2 datasets.[15, 16] All Phase 1 and Phase 2 reads were processed as follows. First, all reads were deduplicated looking for exact and exact reverse complement duplication. The duplicity count for each read was recorded. The adapters were trimmed with Cutadapt v1.13.[40] Reads with low-quality or low complexity were removed with PrinSeq v0.20.4.[41] To reduce computational complexity, reads that aligned to mosquito references with an edit distance of five or better were subsampled at a rate of 1.0%. The references for subsampling consisted of: An. gambiae (g4 assembly; GCA000150785.1), An. coluzzii (m5 assembly; GCA000150765.1) and An. gambiae str. PEST (AgamP3 assembly; GCF000005575.2).[42, 43] All reads were aligned with SNAP-aligner with an edit distance limit of up to 20 against selection of RefSeq and, GenBank assemblies (615,026 total accessions retrieved June 2018).[30] The selection aimed to have at least one high quality assembly for every species taxonomic identifier. All viral references also retrieved from NCBI’s GenBank in June 2018 were included. A metagenomic assignment of reads to accessions was computed based on a BMM and implemented as an Expectation-Maximization (EM) algorithm and extended with a heuristic that prefers accessions with uniform coverage. An accession was assigned to each read producing a probabilistic BMM call, revealing the most likely taxonomic assignment.

Integrator Pipeline

For all Ag1000G samples we applied these steps: 1) All reads classified by BMM as bacterial or unaligned were assembled with SpaDES v3.14.[44] Contigs of minimum length of 2,000 bp were analyzed producing probable bacterial contigs. 2) Probable bacterial contigs were aligned with Diamond v0.9.24.125 aligner against the RefSeq non-redundant (nr) protein database.[45] All Diamond matches for a given contig were aggregated at desired taxonomic level. 3) The taxon with the highest integral of percent identity over contig length was assigned to each contig, resulting with an Integrator assignment for a probable bacterial contig. Steps one through three were reiterated for a set of BMM viral and unaligned reads and produced an Integrator assignment for each probable viral or bacteriophage contig.

DATA AVAILABILITY STATEMENT

The datasets analyzed in the current study are available via the Anopheles gambiae 1000 Genomes Consortium website:

Ag1000G phase 1 AR3.1 data release: https://www.malariagen.net/data/ag1000g-phase1-ar3.1
Ag1000G phase 2 AR1 data release: http://www.malariagen.net/data/ag1000g-phase2-ar1

All data generated during this study are included in this published article (and its Supplementary Information files).

SOFTWARE CODE AVAILABILITY STATEMENT

The Microsoft Premonition pipeline is a proprietary cloud service, and APIs to access this service are made available to select partners through the Microsoft Premonition Early Access Program (terms and conditions apply, see http://microsoft.com/premonition for details). If needed to assist reviewers, authors will provide the computed mixture model statistics, aggregated at the sample level as a data artifact upon request. Access to read-level data is managed by the Ag1000 Genome Consortium (terms and conditions apply) and should be requested directly from the Ag1000 Genome Consortium.

ACKNOWLEDGEMENTS

The authors gratefully acknowledge the Ag1000G Consortium for making the Phase 1 and 2 datasets publicly available. We are indebted to consortium members Mara Lawniczak and Alistair Miles for their valuable comments and guidance. We also thank Christian Gauthier and Renee Ali for their insightful comments.

AUTHOR CONTRIBUTIONS

AP: Analyzed data relating to Anopheles gambiae mosquitoes and vertebrate hosts, Contributed to overall structure of the study and co-author of initial draft.

MRR: Analyzed data relating to Anopheles gambiae mosquitoes and vertebrate hosts and contributed to overall structure of the study and co-author of initial draft. Corresponding author.

XC: Contributed to the generation of figures and co-author of initial draft.

IH: Contributed to overall structure of the study and co-author of initial draft,

JD: Analyzed data relating to Plasmodium and generation of figures.

MEG: Contributed to interpretation relating to Anopheles gambiae mosquitoes and vertebrate hosts.

GC: Analyzed data relating to Plasmodium and generation of figures. Contributed to overall structure of the study and co-author of initial draft.

DEN: Analyzed data relating to Anopheles gambiae mosquitoes and vertebrate hosts, Contributed to overall structure of the study and co-author of initial draft.

JMP: Analyzed data relating to viruses, bacteriophage, and bacteria. Contributed to overall structure of the study and co-author of initial draft.

EKJ: Contributed to overall structure of the study and co-author of initial draft.

ADDITIONAL FUNDING INFORMATION

Authors AP, XC, MRR, IH and EKJ are paid employees of Microsoft Corporation. The funder provided support in the form of salaries for authors [IH, AP, XC, MRR, EKJ], but did not have any role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. JD and GC were supported by funding from the Department of Biological Sciences at Purdue University. DEN and MEG were supported in part by the Johns Hopkins Malaria Research Institute and Bloomberg Philanthropies. Commercial funder Microsoft Research provided financial support to JMP, MEG and DEN.

COMPETING INTERESTS STATEMENT

Authors AP, XC, MRR, IH and EKJ are current, salaried employees of Microsoft Corporation. This does not alter our adherence to Nature Communications' policies on sharing data and materials. All authors declare no conflict of interest.

Anopheles gambiae Genomes, C., et al.: Genetic diversity of the African malaria vector Anopheles gambiae. Nature. 552(7683), 96–100 (2017)
Anopheles gambiae Genomes, C.: Genome variation and population structure among 1142 mosquitoes of the African malaria vector species Anopheles gambiae and Anopheles coluzzii. Genome Res. 30(10), 1533–1546 (2020)
Gillies, M.T., De Meillon, B.: The Anophelinae of Africa South of the Sahara (Ethiopian Zoogeographical Region), 2nd edn. The South African Institute for Medical Research, Johannesburg, South Africa (1968)
Wang, Y., et al.: Dynamic gut microbiome across life history of the malaria mosquito Anopheles gambiae in Kenya. PloS one. 6(9), e24767 (2011)
Drummond, C., et al.: Stability and detection of nucleic acid from viruses and hosts in controlled mosquito blood feeds. PLoS One. 15(6), e0231061 (2020)
Borland, E.M., Kading, R.C.: Modernizing the Toolkit for Arthropod Bloodmeal Identification.Insects, 12(1). (2021)
Brumfield, K.D., et al.: Microbial resolution of whole genome shotgun and 16S amplicon metagenomic sequencing using publicly available NEON data. PLoS One. 15(2), e0228899 (2020)
Jovel, J., et al.: Characterization of the Gut Microbiome Using 16S or Shotgun Metagenomics. Front. Microbiol. 7, 459 (2016)
Brinkmann, A., Nitsche, A., Kohl, C.: Viral Metagenomics on Blood-Feeding Arthropods as a Tool for Human Disease Surveillance.Int J Mol Sci, 17(10). (2016)
Fauver, J.R., et al.: The Use of Xenosurveillance to Detect Human Bacteria, Parasites, and Viruses in Mosquito Bloodmeals. Am. J. Trop. Med. Hyg. 97(2), 324–329 (2017)
Fauver, J.R., et al.: Xenosurveillance reflects traditional sampling techniques for the identification of human pathogens: A comparative study in West Africa. PLoS Negl. Trop. Dis. 12(3), e0006348 (2018)
Batson, J., et al.: Single mosquito metatranscriptomics identifies vectors, emerging pathogens and reservoirs in one assay.Elife,10. (2021)
Tringe, S.G., Rubin, E.M.: Metagenomics: DNA sequencing of environmental samples. Nat. Rev. Genet. 6(11), 805–814 (2005)
Garlapati, D., et al.: A review on the applications and recent advances in environmental DNA (eDNA) metagenomics. Reviews in Environmental Science and Bio/Technology. 18(3), 389–411 (2019)
Consortium, T.A.G.: Ag1000G phase 1 AR3.1 data release. MalariaGEN, Editor (2016)
Consortium, T.A.G.: Ag1000G phase 2 AR1 data release. MalariaGEN, Editor (2017)
Reppell, M., Novembre, J.: Using pseudoalignment and base quality to accurately quantify microbial community composition. PLoS Comput. Biol. 14(4), e1006096 (2018)
Institute, W.S.: The Darwin tree of life project. ; Available from: (2021). https://www.darwintreeoflife.org
Giraldo-Calderón, G.I., et al.: VectorBase: an updated bioinformatics resource for invertebrate vectors and other organisms related with human diseases. Nucleic Acids Res. 43(D1), D707–D713 (2015)
Poelchau, M., et al.: The i5k Workspace@NAL–enabling genomic data access, visualization and curation of arthropod genomes. Nucleic Acids Res. 43(Database issue), D714–D719 (2015)
Gifford-Gonzalez, D., Hanotte, O.: Domesticating Animals in Africa: Implications of Genetic and Archaeological Findings. J. World Prehistory. 24(1), 1–23 (2011)
Dong, Y., et al.: Reference genome of wild goat (capra aegagrus) and sequencing of goat breeds provide insight into genic basis of goat domestication. BMC Genom. 16(1), 1–11 (2015)
Renaud, G., et al.: Improved de novo genomic assembly for the domestic donkey. Sci. Adv. 4(4), eaaq0392 (2018)
Schneider, V.A., et al.: Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27(5), 849–864 (2017)
Costantini, C., et al.: Mosquito behavioural aspects of vector-human interactions in the Anopheles gambiae complex. Parassitologia. 41(1–3), 209–217 (1999)
Takken, W., Verhulst, N.O.: Host preferences of blood-feeding mosquitoes. Annu. Rev. Entomol. 58, 433–453 (2013)
White, G.: Anopheles gambiae complex and disease transmission in Africa. Trans. R. Soc. Trop. Med. Hyg. 68(4), 278–298 (1974)
Gillies, M.T., Coetzee, M.: A supplement to the Anophelinae of Africa south of the Sahara. Publication Johannesburg, South Africa: The South African Institute for Medical Research. 1-143. (1987)
Besansky, N.J., Hill, C.A., Costantini, C.: No accounting for taste: host preference in malaria vectors. Trends Parasitol. 20(6), 249–251 (2004)
Zaharia, M., et al.: Faster and more accurate sequence alignment with SNAP.arXiv preprint, (2011). arXiv.1111.5572.
Gardner, M.J., et al.: Genome sequence of the human malaria parasite Plasmodium falciparum. Nature. 419(6906), 498–511 (2002)
Miles, A., et al.: Indels, structural variation, and recombination drive genomic diversity in Plasmodium falciparum. Genome Res. 26(9), 1288–1299 (2016)
Matsuzaki, M., et al.: Large amounts of apicoplast nucleoid DNA and its segregation in Toxoplasma gondii. Protoplasma. 218(3–4), 180–191 (2001)
Waller, R.F., McFadden, G.I.: The apicoplast: a review of the derived plastid of apicomplexan parasites. Curr. Issues Mol. Biol. 7(1), 57–79 (2005)
Cantalupo, P.G., Katz, J.P., Pipas, J.M.: Viral sequences in human cancer. Virology. 513, 208–216 (2018)
Krzywinski, M., et al.: Circos: an information aesthetic for comparative genomics. Genome Res. 19(9), 1639–1645 (2009)
Zinter, M., et al.: Towards precision quantification of contamination in metagenomic sequencing experiments. Microbiome. 7(1), 1–5 (2019)
Chen, S., Bagdasarian, M., Walker, E.D.: Elizabethkingia anophelis: molecular manipulation and interactions with mosquito hosts. Appl. Environ. Microbiol. 81(6), 2233–2243 (2015)
Kämpfer, P., et al.: Proposal of Thorsellia kenyensis sp. nov. and Thorsellia kandunguensis sp. nov., isolated from larvae of Anopheles arabiensis, as members of the family Thorselliaceae fam. nov. Int. J. Syst. Evol. MicroBiol. 65(2), 444–451 (2015)
Martin, M.: Cutadapt removes adapter sequencese from high-throughput sequencing reads. EMBnet J. 17(1), 10–12 (2011)
Schmieder, R., Edwards, R.: Quality control and preprocessing of metagenomic datasets. Bioinformatics. 27(6), 863–864 (2011)
Lawniczak, M., et al.: Widespread divergence between incipient Anopheles gambiae species revealed by whole genome sequences. Science. 330(6003), 512–514 (2010)
Sharakhova, M.V., et al.: Update of the Anopheles gambiae PEST genome assembly. Genome Biol. 8(1), R5 (2007)
Bankevich, A., et al.: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19(5), 455–477 (2012)
Buchfink, B., Xie, C., Huson, D.H.: Fast and sensitive protein alignment using DIAMOND. Nat. Methods. 12(1), 59–60 (2015)
Cantalupo, P.G., Pipas, J.M.: Detecting viral sequences in NGS data. Curr. Opin. Virol. 39, 41–48 (2019)

Table 1

Total number and percent total of reads recovered by the BMM pipeline.
Phylum	Number of reads	% of total reads
Arthropoda	140,371,534,239	93.0%
Chordata	4,713,450,496	3.1%
Bacteria	834,882,394	0.6%
Plasmodium	5,344,273	0.0%
Viruses	2,040,225	0.0%
Plantae, Fungi, Other	24,739,773	0.0%
Not clean*	3,655,091,803	2.4%
Unaligned	1,299,025,854	0.9%
Total	150,900,764,783	100.0%
*Flagged and removed by PrinSeq or Cutadapt

Table 2

Percentage of Vertebrate (Phylum: Chordata) host reads (n = 4,713,450,496) identified in mosquito specimens.
Family (Genus)	% of total Chordata reads
Hominidae (Homo)	82.3%
Bovidae (Bos, Capra)	13.7%
Canidae (Canis)	1.2%
Equidae (Equus)	2.1%
Phasianidae (Gallus)	0.7%
Total	100.0%

Table 3

Number of specimens containing vertebrate host reads.
Family (Genus)	# of specimens containing Chordata reads*
Hominidae (Homo)	138
Bovidae (Bos, Capra)	16
Canidae (Canis)	5
Equidae (Equus)	2
Phasianidae (Gallus)	1
Total	162
* ≥25%coverage

There is NO Competing Interest. Authors AP, XC, MRR, IH and EKJ are current, salaried employees of Microsoft Corporation. This does not alter our adherence to Nature Communication Biology's policies on sharing data and materials. All authors declare no conflict of interest.

Download PDF

Version 1

posted

You are reading this latest preprint version

A metagenomic analysis of the phase 2 Anopheles gambiae 1000 genomes dataset reveals a wide diversity of cobionts associated with field collected mosquitoes

Status:

Version 1

Abstract

Figures

Introduction

Results And Discussion

Conclusions

Materials And Methods

Introduction To Metagenomics Bayesian Mixture Model

Metagenomic Analyses

Integrator Pipeline

Declarations

References

Tables

Additional Declarations

Supplementary Files

Status:

Version 1