Metagenome-validated Parallel Amplicon Sequencing and Text Mining-based Annotations for Simultaneous Proling of Bacteria and Fungi: Vaginal Microbiome and Mycobiota in Healthy Women

Background: Amplicon sequencing of kingdom-specic tags such as 16S rRNA gene for bacteria and internal transcribed spacer (ITS) region for fungi are widely used for investigating microbial populations. So far most human studies have focused on bacteria while studies on host-associated fungi in health and disease have only recently started to accumulate. To enable cost-effective parallel analysis of bacterial and fungal communities in human and environmental samples, we developed a method where 16S rRNA gene and ITS-1 amplicons were pooled together for a single Illumina MiSeq or HiSeq run and analysed after primer-based segregation. Taxonomic assignments were performed with Blast in combination with an iterative text-extraction based ltration approach, which uses extensive literature records from public databases to select the most probable hits that were further validated by shotgun metagenomic sequencing. Results: Using 50 vaginal samples, we show that the combined run provides comparable results on bacterial composition and diversity to conventional 16S rRNA gene amplicon sequencing. The text-extraction-based taxonomic assignment guided tool provided ecosystem specic annotations that were conrmed by Metagenomic Phylogenetic Analysis (MetaPhlAn). The metagenome analysis revealed distinct functional differences between the bacterial community types while fungi were undetected, despite being identied in all samples based on ITS amplicons. Co-abundance analysis of bacteria and fungi did not show strong between-kingdom correlations within the vaginal ecosystem of healthy women. Conclusion: Combined amplicon sequencing for bacteria and fungi provides a simple and cost-effective method for simultaneous analysis of microbiota and mycobiota within the same samples. Text extraction-based annotation tool facilitates the characterization and interpretation of dened microbial communities from rapidly accumulating sequencing and metadata readily available through public databases. vaginal specic bacteria, while there are only 1,458 entries for fungi, highlighting the disparity. We show that a


Introduction
Pro ling of bacterial communities using 16S rRNA gene amplicon sequencing is the standard tool for analysing the composition and diversity of microbiota in human and environmental samples. The fungal component of the microbiota, also called mycobiota, has a long history of research in mycorrhizal fungi and other environmental samples, but has only recently gained interest as part of human commensal microbiota. Based on recent molecular studies, there are an estimated 10 12 − 10 13 fungi compared to 10 13 − 10 14 bacteria in the human microbiota, across the gastrointestinal tract, oral cavity, vaginal mucosa, and skin [1]. The most abundant fungi colonizing humans belong to phyla Ascomycota and Basidiomycota. The common fungal genera found in the human mouth, gut, skin, lungs and vagina in healthy individuals as well as in selected diseases has been extensively reviewed elsewhere [1,2]. Fungi can interact with the host as well as the rest of the microbiota in both mutualistic and antagonistic ways, highlighting their relevance for human health [3][4][5][6]. For the vaginal ecosystem, the bacterial composition is well described. For most healthy women, it predominantly consists of Lactobacillus species, namely L. crispatus, L. iners, L. gasseri, or L. jensenii, while the presence of Gardnerella vaginalis and an assortment of other, typically anaerobic species is indicative of dysbiosis and prevail e.g. in bacterial vaginosis (BV) [7]. The fungal component of the vaginal ecosystem is poorly characterized, as after the pioneering research of Drell et al. in 2013, where they characterized the vaginal mycobiota and microbiota of 494 women, progress has been slow [8]. Only a handful of original papers have been published on the vaginal mycobiota, yet it has been associated with various diseases, most obviously yeast infections, but also with vulvodynia, bacterial dysbiosis, and cervical dysplasia [9,10].
Despite the clear biological relevance of fungi and interest for cross-kingdom studies, there are limited tools available for such analyses. While metagenomics, i.e. whole genome shotgun sequencing (WGS) in principle allows sequencing of entire microbial communities, in practice the very low relative abundance of fungi (e.g. in the vagina in the range of 0.17 ± 0.04%, if detected at all [11] and in the gut 0.01% [12]), which is a hindrance for a cost-effective analysis. There is a rapidly growing number of studies that have processed and sequenced both 16S rRNA gene and ITS amplicons separately for parallel characterization of fungal and bacterial communities in the same samples. In biomedical research, these include studies addressing the impact of sample storage and DNA extraction [13,14], descriptive compositional analyses of human samples in cross-sectional [8] and longitudinal settings [9], and their correlation to (disease) phenotypes [15][16][17][18].
There is an increasing demand for affordable and robust methods to study dominant members of eukaryotic microbes that constitute the human microbiome i.e. fungi, alongside bacteria. To address this, we developed a parallel amplicon sequencing technique that combines bacterial 16S rRNA gene and the fungal ITS region in a single Illumina sequencing run. This study aims to determine the feasibility and performance of this method by applying it to vaginal samples from 50 healthy women and characterizing their bacterial and fungal pro les, with shotgun metagenomics as a reference. In addition, we also attempt to assign species level annotations from amplicon sequencing data using a custom text miningbased ltration approach.

Sample collection and processing
We sampled 50 non-pregnant Caucasian women aged 25-45 years attending population-based cervical cancer screening as previously described [19]. The exclusion criteria were vaginal intercourse 48 hours prior to sampling, pregnancy, previous hysterectomy, and the inability to tell the time of last menstrual period. Detailed subject characteristics have been previously reported [19]. All samples were collected with a sterile ocked swab (FLOQSwabs, Copan spa, Italy) from the right fornix of the vagina during a speculum exam. DNA extraction was done using bead beating as described [19]. The extracted DNA was stored at -20°C.
Sequencing of 16S rRNA gene and Internal transcribed spacer-1 (ITS-1) amplicons Data generated from an earlier study [19] that targeted the V3-V4 region of the 16S rRNA gene was used as a control dataset (PRJEB25778) to determine any data loss or variation that might be brought on by parallel sequencing. This sequencing run was done using paired end mode (PE250 kit) and Illumina HiSeq as detailed previously [19]. For the parallel sequencing approach, the 16S rRNA gene and ITS amplicons were generated separately from the previously extracted frozen DNA and pooled for barcoding.
For fungi, a two-step PCR was applied due to the poor ampli cation when using amplicon-speci c primers with overhang adapters. PCR-amplicons of the ITS-1 region were generated using ITS1F and ITS2 primers (5′-GGTCATTTAGAGGAAGTAA-3′ and 5′-GCTGCGTTCTTCATCGATGC-3′ [22,23]. The primers for the second PCR to add the adapters were adapted from [24]. Both ITS PCR reactions were carried out in a total reaction volume of 20 µl consisting of 10 µl of 1X Phusion® Master Mix, 4.4 µl of the template, 0.5 µl of each primer, 0.6 µl of DMSO, with water added to reach the nal volume. Template concentration at the rst ITS PCR was 22 ng of sample DNA and at the second the amplicon derived from the previous PCR step was used. The cycling parameters for both ITS PCRs were 98°C for 60 s, followed by 44 cycles of 98°C for 10 s, 58°C for 40 s, 72°C for 40 s and nally 10 min at 72°C, after which the samples were stored at 4°C. Both 16S rRNA gene and ITS PCR products were visualized using agarose gel electrophoresis and then puri ed with AMPure XP beads (Beckman Coulter, Copenhagen, Denmark). After puri cation, both PCR products were quanti ed using Quant-it PicoGreen dsDNA assay kit, 2000 assays (Invitrogen, cat.nro P7589). The index PCR was performed to add the barcoded primers [24] for sequencing, which was done using the following settings: 98°C for 60 s, 30 cycles of 98°C for 10 s, 64°C for 30 s, 72°C for 30 s and nally 10 min at 72°C. The 16S rRNA gene amplicons were diluted to 40 nM and the two locus-speci c PCR products were mixed 1:1 for volume to be used as a template. The PCR ampli cation was performed with 4 µl template, 10 µl 1X Phusion® Master Mix, 0.5 µM each barcoded primer, 0.6 µl DMSO, with water added to reach the nal volume of 20 µl.
A 50 nM pool was prepared, and its clean-up was performed with AMPure XP beads (Beckman Coulter, Copenhagen, Denmark). The pooled 16S rRNA gene V3-V4-ITS amplicon mixture was sequenced at the Biomedicum Functional Genomics Unit (FuGU), Helsinki, Finland with an Illumina MiSeq instrument using paired end 2 × 300 bp reads and a MiSeq v3 reagent kit. The loading concentration was 7.5 pM with 15% PhiX spike-in.
Pre-processing and splitting of 16S rRNA gene and ITS reads First, we removed reads with ambiguous (Ns) bases and used cutadapt version 2.6 to trim the primers from raw reads and to split them into separate 16S rRNA gene and ITS amplicon datasets (fastq-les) with "discard_untrimmed" option [25]. With this method the 16S rRNA gene or ITS library could be extracted from the mixed library and analysed independently. We used the same pre-processing steps for both sequencing runs including the library split. After the library split, removal of primers, and "Filter and Trim", the two 16S rRNA gene libraries were pre-processed together with the DADA2 package version 1.12.1 in R version 3.6.1, following the DADA2 Pipeline Tutorial (1.12) [26][27][28]. The ITS library was preprocessed as in DADA2 ITS Pipeline Work ow (1.8) except the removal of primers and ambiguous bases was already done during the library split [29].

Taxonomic assignment
The vaginal microbiota in most women consist of one or more genomically and functionally distinct Lactobacillus species that drive the well-established categorical vaginal community state types and determine the various compositions of bacteria and fungi [30,31]. To study the composition of this ecosystem it was rendered necessary to be able to assign taxonomies at species level resolution. These assignments can be both di cult and unreliable when sequence homology is the only basis for distinction, as is the case with popular tools and databases such as SILVA [32], RDP [33], and GreenGenes [34]. This is primarily due to the limited sequence information contained within the individual hypervariable regions of the 16S rRNA gene/ITS region, as well as due to incomplete databases that also contain reported mis-annotations [35,36]. For accurate annotations, our requirements were 1) an extensive and up-to-date reference database that can capture the largest number of taxa, and 2) background information that can be utilized to determine the speci c ecosystem that each taxon has been isolated from to support the choice of the most probable hit for species that are poorly or completely indistinguishable in terms of sequence homology alone. Aligning the sequences using NCBI BLAST against the nucleotide (nt) database ful lled the rst requirement. The tabular output obtained as a result (outfmt 6), allowed us to extract accession IDs for each hit, which serve as keys to NCBI's extensive network of databases that were accessed through the rentrez R package [37]. We extracted information such as isolation source, host, and publications (titles and abstracts) associated to each hit and concatenated this information into one large block of text. Next, we constructed word queries (collectively termed "word banks", Figure 1), that consist of lists of words specifying the ecosystem of interest. These word banks were the basis for ltration, searching for key words within this block of text to determine the most likely annotations. As an example, to obtain annotations for the "Human vagina", a word bank would consist of the following words: "human", "homo sapiens", "woman", "women", "vagina", "vaginal", "cervicovaginal", "female genital tract", "vaginitis" etc. For complete contents of each word bank see Additional le 3. This tool ("taxminer") can be exibly used to stack different sets of word banks in succession to gain the maximum number of annotations between interrelated ecosystems, resulting in varying degrees of relevance and con dence. This trickle-down approach prevents oversaturated entries from masking real results in a pool of multiple hits and apply a hierarchy between different ecosystems relevant for the study. In our study, we stacked the vaginal, gut + skin, and clinical word banks as shown in Figure 1. To ensure the veracity of the tool, the selected annotations were manually veri ed and con rmed. While we acknowledge that this still does not provide a complete solution to species-level taxonomic annotations, which is an inherent problem with 16S rRNA gene amplicon sequencing, this method allows for a customized search strategy to obtain the mostly likely knowledge-based results. By utilizing a rich resource of information, we can intuitively select the best hits and weed out mis-annotations, both of which would prove to be a signi cant challenge otherwise.

Statistical methods
The R package vegan (2.5-7) was used to calculate: 1) alpha diversity (Shannon), rarefaction, and 2) permutational multivariate analysis of variance (PERMANOVA) through the "adonis" function to quantify the variance explained by the sequencing run type and other technical as well as clinical and other host variables on the bacterial and fungal composition within the samples [30]. Principal coordinate analysis was done using R packages phyloseq (1.34.0) and vegan, and the gures were plotted with ggplot2 (3.3.3) [38]. All beta-diversity measures were based on Bray-Curtis dissimilarity. To classify the samples into community state types (CSTs) based on the microbial compositions, we used a recently developed tool called VALENCIA (VAginaL community state typE Nearest CentroId clAssi er) [31]. All bar plots are created using ggplot2 and heatmaps are created with pheatmap [32,33]. Correlation analysis between bacteria and fungi was done using Spearman correlation using rcorr.

Shotgun metagenomic sequencing
To validate the performance of the combined 16S rRNA gene+ITS amplicon sequencing, shotgun metagenomic sequencing was carried out on a subset of 21 samples that had su ciently DNA left (min 300 ng). The libraries were prepared using Illumina Nextera™ DNA Flex Library Prep Kit and sequenced on Illumina's NovaSeq6000 S4 PE150 run. [39]. To evaluate the accuracy of the text mining-based annotation method on the 16S rRNA gene and ITS amplicon data, the metagenome data was analysed with MetaPhlAn 3.0 [40]. We also performed a kingdom level search with Kraken2 (version 2.0.8) and RefSeq bacterial, archaea, viral, fungi and protozoa database (krak_microb) [39]. Finally, the HUMAnN 3.0 pipeline [41] was used to obtain the functional pro les for each sample. The reads per kilobase (RPK) values contained within the pathway abundance output le were converted into counts per million (CPM), using the renormalizing utility ("humann_renorm_table"). The gene families output les provided by the HUMAnN pipeline were regrouped into level 4 EC categories using the regrouping utility ("humann_regroup_table"), to visualize the enzyme level metabolic potential of the microbes. These were also renormalized and combined with the pathway abundance les using the unpacking utility ("humann_unpack_pathways"). For the heatmap(s) of functional pro les, complete linkage clustering was applied on both rows and columns and the CPM values mapped were log transformed. Further information associated to each EC number such as enzyme name, reaction, substrate etc was extracted using KEGGREST [42].

Results
High concordance for bacterial data between combined 16S rRNA gene+ITS versus conventional 16S rRNA gene amplicon sequencing The main aim of this study was to determine the feasibility and performance of a combined method of sequencing for bacterial 16S rRNA gene and fungal ITS region amplicons (hereafter also referred as combined run). Since the 50 vaginal samples were already analysed in a previous publication for the bacterial compositions, the existing 16S rRNA gene amplicon data (also referred to as reference run) and the results obtained therein serve as a point of reference [19]. To make the previous and current runs comparable, the outputs were analysed together through the same pre-processing and annotations pipeline as described in Methods. The 16S rRNA gene+ITS run (hereafter referred also as combined run for short), produced a mean of 120,186 reads per sample (for 16S rRNA gene 38,093 and ITS 82,423 reads per sample) on high read quality (quality pro les Additional le 1: Supplement Fig. 1). Details of both sequencing runs can be found in the Table 1. Rarefaction curves of the combined sequencing show that the sequencing depth was more than su cient to capture all the species diversity for both amplicon types. Maximum diversity for all samples was detected with ~38,000 and ~32,000 reads, while 75% of the samples reached plateau at ~17,000 and ~600 reads for bacteria and fungi respectively indicating that the samples had been sequenced to su cient depth to capture their true diversity. (1,363 -54,865) * The reads containing ambiguous bases (Ns) were removed before the library split. ** No PCR-amplification for ITS; likely reflecting cross-reactivity of 16S rRNA gene primers with 18S rRNA gene region upstream of ITS1.
In terms of bacterial composition, the pro les from the two runs (16S rRNA gene and 16S rRNA gene+ITS) were highly similar (Pearson's correlation 0.97), vast majority of the samples clustering tightly in Principal coordinate analysis (PCoA) plots ( Figure 2) and negligible variance being explained by the run type in PERMANOVA (R2 = 0.004, p = 0.9). As previously published [19], the most abundant bacteria within the sample were of the genus Lactobacillus, a majority of which were L. iners and L. crispatus (Figure 3; parallel stacked bar plots), both being comparable in terms of prevalence between the runs -L. iners (16S rRNA gene: 25/50, combined: 23/50), L. crispatus (16S rRNA gene: 21/50, combined 19/50). Gardnerella vaginalis, a Bi dobacterium-related species associated with bacterial vaginosis was observed in 17 samples in the 16S rRNA gene-only run and 16 samples in the combined run. Fannyhessea vaginae (formerly known as Atopbium vaginae) was seen to mostly co-occur with G. vaginalis. We also observed Clostridiales genomosp. (CP049781.1) in non-Lactobacillus-dominated samples that were seen to contain BV-associated bacteria upon visual inspection. This particular bacterium was previously known as Bacterial vaginosis associated bacteria 1 (BVAB1) and has recently been renamed as "Candidatus Lachnocurva vaginae" [36]. When the ASVs for this microbe were annotated against the silva database, they were assigned as Shuttleworthia, which is a known mis-annotation for these dedicated databases [35,36]. For a full list of BLAST vs silva vs RDP annotations for the ASVs see Additional le 4. The bacterial pro les were also assigned to CSTs and subCSTs (Addiitonal le 1: Supplement Fig. 2).
Vaginal mycobiota based on ITS amplicon sequencing Based on the ITS-reads of the combined run, all 50 samples contained fungi (Fig. 4). A total of 146 fungal ASVs were successfully annotated and were further agglomerated by species and ltered based on prevalence ≥ 2 of the samples and abundance > 100 reads. This resulted in 23 unique fungal annotations (Figure 4), with the phylum level distribution between the samples as follows: Ascomycota (47.8%, 11/23), Basidiomycota (43.5%, 10/23), and unclassi ed (8.7%, 2/23). The most abundant phylum within the samples were the Ascomycota, accounting for 83.6% of the total reads, with the most abundant species being Candida albicans (49.6%), Alternaria alternata (13.2%), Cladosporium species (10%), Epicoccum nigrum (8.3%), and Saccharomyces cerevisiae (1.1%) (Figure 4 In PERMANOVA analysis, among the 29 tested technical and host variables, the number of lifetime sex partners was most strongly associated to the fungal composition (R2 = 0.07, p = 0.0003), while also history of BV (R2 = 0.04, p = 0.04) and the day of menstrual cycle at sampling reached statistical signi cance (R2 =0.03, p = 0.047). To determine the co-occurrence patterns between vaginal bacteria and fungi, the abundance pro les of microbes that occurred in at least 5 samples were correlated and visualized ( Figure. 5). We observed that there were no signi cant correlations between the dominant lactobacilli -L. crispatus and L. iners -and fungi. However, L. acidophilus and L. jensenii were positively correlated to M. restricta while L. reuteri was seen to be positively correlated to an uncultured Peniophora. Additionally, M. restricta was negatively correlated with Prevotella spp. and Sneathia amnii. Other correlations of note were the negative correlation between G. vaginalis and C. allicinum and the positive correlation between C. albicans and Streptococcus anginosus.
Metagenomics results in relation to the 16S rRNA gene and parallel 16S rRNA gene+ITS amplicon sequencing To assess the performance and coverage of the parallel 16S rRNA gene+ITS sequencing, we compared the taxonomic outputs to metagenomic shotgun sequencing. We also needed to benchmark the textmining based annotation pipeline against an independent method of species recognition. The 21 libraries consisted of total 557,102,162 read pairs 2x150 (per sample average 26,528,674; min 10,276,162; max 46,160,072). We found no falsely annotated species among abundant and well-known (cultured) species with our 16S rRNA gene annotation pipeline compared to clade-speci c metagenomic species annotation (MetaPhlAn). Bacterial compositions were highly similar between the amplicon and metagenomic sequencing ( Figure 6, Additional le 1: Supplement Fig. 5). There was some variation in G. vaginalis detection between amplicon sequencing and metagenomics, however: G. vaginalis was not detected at all with metagenome in two samples (1216 and 1261) and was slightly more abundant in the remaining samples when detected. Overall, there was high concordance for the bacterial compositions across all three sequencing runs (Pearson's correlation for combined vs WGS 0.93, PERMANOVA for all three runs R2 = 0.007, p = 0.88).
Fungi could not be detected through metagenomics either with MetaPhlAn or Kraken2, which use cladespeci c marker genes and k-mer compositions for taxonomic assignments, respectively. In kingdom level analysis the yield for fungal reads was only 0.08-0.17% and the fungal community compositions could not be determined except S. cerevisiae was detected in a single sample (ID: 1192) through MetaPhlAn.

Functional attributes of the vaginal microbiota
After removal of the human reads, functional pro ling with the HUMAnN pipeline identi ed 104 metabolic pathways and 381 enzymes, revealing the metabolic potential of the bacteria present within these samples (n=21). Hierarchal clustering of the functional pro les successfully rearranged and regrouped samples with similar bacterial compositions and resultant CSTs (Figure 7), suggesting that there are identi able differences in the metabolic potential of the typical bacterial populations in the vagina. For a full list of individual pathways, their enzymes, and EC numbers see Additional le 5.
The rst cluster contained pathways that are essential to life and were observed across all samples. This included coenzyme A biosynthesis (COA-PWY), tRNA charging (TRNA-CHARGING-PWY), which consists of a group of 19 ligases responsible for amino acid activation, and cell wall biosynthesis pathways, mainly UDP-N-acetyl-D-glucosamine (UDPNAGSYN-PWY) and peptidoglycan biosynthesis (PEPTIDOGLYCANSYN-PWY). Pyruvate fermentation to lactate and acetate (PWY-5100) was largely observed in samples containing lactobacilli, highlighting their capacity to produce lactic acid.
A second cluster, representing pathways observed in samples containing L. crispatus (or CST I), consisted of amino acid biosynthesis, avin biosynthesis, folate transformations, amino sugar degradation, and acetylene degradation pathways. There were nine enzymes related to the superpathway of L-lysine, L-threonine and L-methionine biosynthesis (PWY-724), shared between L. crispatus and Bi dobacterium spp. but were absent in L. iners. A further 7 enzymes related to the Llysine biosynthesis pathway (PWY-2941) were observed within this cluster, with pathway variant II, distinguished by the enzyme N-acetyl-diaminopimelate deacetylase (EC 3.5.1.47), exclusively seen in L. crispatus. Two other variants (III & VI) that share the remaining six enzymes were also observed in Bi dobacterium spp., while they were completely absent in L. iners. An amino sugar degradation pathway, superpathway of N-acetylglucosamine (GlcNAc), N-acetylmannosamine (ManNAc) and Nacetylneuraminate (NeuAc) degradation (GLCMANNANAUT-PWY), and acetylene degradation (P161-PWY) were also observed to be exclusive to L. crispatus.
Another group of sparsely distributed pathways was observed to be speci c to samples containing L. iners (or CST III), which included CDP-diacylglycerol biosynthesis (PWY-5667), 6-hydroxymethyldihydropterin diphosphate biosynthesis (PWY-6147), and six enzymes from the mevalonate pathway (PWY-922). The last cluster of interest consisted of pathways observed only in samples containing Bi dobacterium spp. or G. vaginalis (CST IV C-3 and CST IV B). This included two pathways for the synthesis of inosine monophosphate (IMP); a super pathway of 5-aminoimidazole ribonucleotide biosynthesis (PWY-6277) and an inosine-5'-phosphate biosynthesis pathway (PWY-6123).

Discussion
Performance and output of the combined 16S rRNA gene and ITS amplicon sequencing With the rapidly expanding list of microbes being characterized for their role in human health and disease, it is crucial to be able to identify and differentiate between i) commensal microbes that co-exist in a symbiotic state within various niches of interest within the human body and ii) pathogens that induce dysbiosis and drive several clinical conditions. Studies targeting colonization patterns of fungi are particularly scarce, especially for the vaginal ecosystem when compared to bacteria [5,6]. A simple text search performed on the NCBI nucleotide database shows that there are 209,784 entries for vaginal speci c bacteria, while there are only 1,458 entries for fungi, highlighting the disparity. We show that a combined sequencing method that targets the 16S rRNA gene for bacteria and the fungal ITS region is a convenient strategy to study two distinct and clinically relevant microbial kingdoms in parallel without loss of data or incurring notable extra costs.
To our knowledge, this work is the rst to report an amplicon sequencing method to study bacteria and fungi simultaneously within human samples using Illumina sequencing technology. Previously, Kittleman et al. reported a pyrosequencing approach that combined analysis of phylogenetic marker genes for bacteria, archaea and eukaryotic micro-organisms in ruminal samples and Coller et al. combined 16S rRNA gene and fungal ITS1 amplicons to an Illumina MiSeq run to study soil samples [43,44]. In addition, many studies have generated 16S rRNA gene and fungal ITS amplicons in the same samples, but the analyses were done separately [8,9,[13][14][15][16][17][18]. Also, bacterial ITS amplicons have been employed for improved taxonomic resolution in parallel to conventional 16S rRNA gene amplicon sequencing [45,46]. Kits for single library constructions for bacteria and fungi are also commercially available (at least from Swift, Qiagen and Perkin Elmer), but the costs per sample are substantially higher and hence our method provides a cost-effective option for laboratories that process high sample volumes and possess the basic expertise for the index-PCR. In earlier studies where a single Illumina run was used for sequencing of both bacteria and fungi, the index-PCR was performed separately for each type of amplicon and they were only combined for sequencing [43,44]. Hence, to our knowledge we are the rst to report a protocol where the pooling of bacterial and fungal amplicons takes place already after the locus-speci c PCRs, and hence saves time and costs as all downstream steps including index PCR for both amplicon types occur in a single well. Use of the same barcodes for bacterial and fungal amplicons further simpli es the design and improves the e ciency especially when there are limited number of barcodes available.
By combining 16S rRNA gene and ITS amplicon sequencing in a parallel setting we were able to achieve comprehensive community composition data for both bacterial and fungal communities with very little added cost compared to ordinary 16S rRNA gene sequencing. The 16S rRNA gene and ITS libraries can be separated easily with publicly available bioinformatic tools like cutadapt that was used here and was the only additional step in pre-processing. The rarefaction curves indicated that with the chosen input of 177 vaginal samples in a single MiSeq run, both bacterial and fungal species were sequenced to an extent su cient to represent their true diversity (Additional le 1: Supplement Fig. 3). Comparisons to conventional 16S rRNA gene amplicon and metagenomic sequencing demonstrated that the newly developed method produced highly similar results for bacterial compositions and there was a high degree of reproducibility between the results despite some methodological differences (HiSeq vs MiSeq, PE250 vs. PE300 chemistry) as a possible source of variation.
The simultaneous analysis of bacterial and fungal communities from the same samples is not a trivial approach since e cient and unbiased cell disruption is required to retrieve representative nucleic acids for further analysis. We used bead beating method for DNA extraction that has been previously shown to perform well on disrupting the rigid cells walls of fungi [47]. The analysis of fungal components of the microbiota is much more challenging than for bacteria e.g. due to the natural length polymorphism of ITS1 length that may complicate both the actual sequencing and subsequent bioinformatic analyses.
Text-mining based annotation and prioritization of hits Achieving species level taxonomic annotations is often desired and crucial when investigating homogenous ecological niches such as the female genital tract, which is predominantly populated by different species of lactobacilli, rendering it necessary to attempt for identi cation of individual species and their patterns of interaction with the host. Presently, sequence homology is the main criterium used for taxonomic classi cations within amplicon sequencing work ows, which however cannot distinguish between highly similar sequences, resulting in only a fraction of the sequence queries being assigned to species-level. This is also challenging in BLAST-based annotations, which can potentially produce thousands of highly similar alignment results and have signi cant amount of noise that may mask real, more relevant annotations. Conventionally, selecting results from the BLAST output would either involve taking the "top-hit" from a sorted list of annotations or require manual scanning. We show that by utilizing readily available information from public databases the selection of the most probable hit can be facilitated by means of knowledge-based species level ecosystem speci c taxonomic annotations. By using BLAST as the alignment tool of choice and the nucleotide database as the reference to maximize the number of species that are potentially detected, our approach automates this selection process.
Taxonomic annotations are split into a two-tiered process: i) sequence homology is determined by BLAST algorithms, with a threshold set at 98% for both "percentage identity" and "query coverage" and ii) ambiguity between identical alignments is eliminated by introducing background information such as previously identi ed host and source of isolation as a selection criterion to identify the most relevant hit. Hypothetically, given a set of 10 indistinguishable species level annotations, this approach allows us to probabilistically select the 5 th alignment indicated to have been isolated from human vaginal swabs, and disregarding the other 9 that came from diverse sources including chicken crop, beer malt, and giant pandas. The ltration can be strict enough to look for speci c strains of a microbe investigated in an environmental niche, or vague enough to accept anything collected from human samples.
While we demonstrate this tool in combination with BLAST based annotations, it can essentially be used as an added layer to any number of existing tools that can provide a link to NCBI accession IDs or attach previously gathered information for each alignment. By allowing any combination of user de ned text queries, it can be customized for each individual study and enhances existing tools by increasing con dence in the nal taxonomic annotations. In addition, information such as publications, journals, authors, lineage, and strain level resolution are also made available in the process. Strain identi cation is of particular interest as it is needed to identify and study the established biological and functional disparity in microbial strains, as discussed later [11,48,49].
We acknowledge that this approach it limited by the inherent limitations of 16S rRNA gene amplicons on discriminating bacteria to species level so that in less characterized ecosystems even the best guess, based on homology and background information, can be of low con dence. Another limitation may come from data availability, especially in terms of identifying novel species for a particular environment. However, the latter is remedied by the trickle-down application of the selection process, where the most important and most likely annotations are assigned rst, followed by expanding the ltration criteria to include the remaining potentially rare/novel taxa with decreasing con dence.

Microbial pro les
Whole genome shotgun (WGS) and an individual 16S rRNA gene sequencing method validated the bacterial pro les obtained from our parallel sequencing approach. Since the samples were from healthy subjects, they were dominated by lactobacilli, while samples with mixed/BV-like microscopy results [19] were indeed populated with G. vaginalis, F. vaginae, Veillonella spp., uncultured bacteria, and C. genomosp. (BVAB1). Since we report relative abundances, the discrepancy observed in the proportion of G. vaginalis observed in some samples could be attributed to the absence of these uncultured bacteria from the metagenome results, which are otherwise make up to 10% of the total abundance of the samples in question. The inability of metagenome pipeline to detect these microbes is largely due to incomplete reference databases. While we were not speci cally searching for individual strains, the nal annotations obtained through the ltration approach contained 11 different strains of G. vaginalis (see Additional le 1: Supplement Fig. 5), a microbe associated to bacterial vaginosis (BV), which have previously been described to exhibit varying capacities for sialidase (an enzyme responsible for mucus degradation), and beta galactosidase production [50][51][52][53]. Different strains of G. vaginalis have been identi ed with varying degrees of virulence based on the presence/absence of the genetic machinery facilitating adherence, bio lm formation, cytotoxicity, glycan degradation, antibiotic resistance, and displacement of lactobacilli [52]. This is indicative of the tendency of some G. vaginalis strains to colonize a non-BV vaginal environment as a commensal microbe while other strains can exhibit pathogenicity, highlighting the importance of achieving strain level resolution to understand their biological characteristics and as a diagnostic tool. As our samples were from healthy subjects, this could represent an early or intermediate state of dybiosis.
While we obtained high concordance between the bacterial pro les obtained with our combined amplicon sequencing vs. WGS approach, fungi could only be detected with the former. This is due to the underrepresentation of fungal reads within the metagenomes as well as their coverage in the reference databases used for WGS approach. The yield for fungal reads within the metagenomes was 0.08 -0.17% in our study and 0.17±0.04% in a previous large scale vaginal microbiota study [11]. Eukaryotic genomes are also signi cantly larger and much more complex compared to prokaryotes, which further hinders metagenome analysis, especially with a restricted set of reference databases [54]. Apart from S. cerevisae being detected in one sample (ID: 1192) with MetaPhlan, there were no positive hits in the WGS data while 23 unique species were detected across the 50 samples by ITS amplicon sequencing. This shows that even in this era of relatively low-cost shotgun metagenomic sequencing, amplicon sequencing still has the edge when studying complex low abundance communities. With the lack of studies focusing on fungi, deposited fungal sequences with valid human isolation sources are also scarce. In our samples only C. albicans, a well-characterized pathogenic yeast responsible for vulvovaginal candidiasis [5], was identi ed from vaginal samples. The remaining annotations were traced back to either the gut (Aspergillus spp.), skin (Malassezia spp. and Cladosporium spp.), or other clinical isolates (S. cerevisiae, Schizophyllum spp.). Of these annotations, Cladosporium spp. were a possible wildcard because of the uncertainty surrounding their isolation source. On one hand, the metadata for Cladosporium spp. hits stated that they were isolated from clinical human samples and certain species have been associated to human and animal infections. On the other hand, they are also widely recognized as household fungi and are shown to be a consistent environmental contaminant in fungal studies [8,55,56].
Investigating the correlations between the micro-and mycobiota revealed some interesting trends, notably the positive correlation between C. albicans and S. anginosus. Both microbes are individually implicated in negative clinical outcomes and Streptococci spp. have been shown to co-aggregate with C. albicans, which allows it to form a thicker bio lm and obtain antibiotic resistance [57,58]. Prevotella spp. and Malassezia spp. have previously been shown to positively correlate on skin, which opposed our ndings, indicating either a different colonization pattern in this ecological niche or presence of different strains [59]. Positive correlation between L. reuteri and an uncultured fungus of genus Peniophora could potentially be attributed to their shared ability to bind/sequester and degrade a atoxin B 1 , a mycotoxin produced by Aspergillus spp., as a survival mechanism [60][61][62]. We speculate that this might be indicative of an evolutionary characteristic that these microbes possess to inhabit the same environment. Among the Cladosporium species, the only observed correlation was a negative one between C. allicinum and G. vaginalis [63,64]. Due to the scarcity of information on the vaginal fungal communities and their potential associations to clinical and other host factors, we carried out the same PERMANOVA analysis for fungi as previously done for the bacterial microbiota of the same samples [65]. We found number of sex partners (lifetime), period day, history of BV, and antibiotic and probiotic use to be associated with fungal abundance pro les while education was the strongest factor explaining bacterial variation [65]. This divergence suggests that the drivers of colonization patterns for vaginal bacteria and fungi may be independent of each other [65]. For the full PERMANOVA analysis results see Additional le 2.

Functional pro les
Even with a relatively small sample size, functional pro les obtained from the metagenome analysis revealed clear separation of bacterial pro les and their metabolic potential. The capacity for L. crispatus to synthesize lysine has been previously identi ed as well as its absence in L. iners [66]. An interesting observation was the co-occurrence of the almost complete mevalonate pathway (six enzymes), which produces cholesterol as an end-product, and inerolysin, a cholesterol dependent cytolysin (CDC) in samples containing L. iners. Presence of the mevalonate pathway has been indicated in urogenital bacterial populations and associated to the ability of certain lactic acid bacteria to cope with oxidative stress, which may be an additional adaptation that allows L. iners to colonize hostile environments as well as co-habit with other microbes, especially G. vaginalis [66][67][68]. CDCs such as inerolysin and vaginolysin (G. vaginalis speci c) have been associated to membrane adhesion and subsequent pathogenicity. Pore formation in the host cell membranes, a characteristic of CDCs, has been speculated to allow the microbes to gain access to nutrients and potentially kill immune cells, allowing L. iners to survive in a dysbiotic state [66,69,70]. A repertoire of enzymes involved in the biosynthesis and degradation of inosine monophosphate (IMP) were observed in samples containing G. vaginalis. Even though the enzymes associated with degradation could not be assigned to a single taxonomy, their exclusivity in these samples already shows an interesting trend. Other studies have discussed inosine in various contexts, including as a metabolite from related intestinal Bi dibacterium pseudolongum that modulates an enhanced immunotherapy response, while for some strains of G. vaginalis it has been indicated as a carbon source [71,72]. These ndings warrant further investigations to understand the role of inosine metabolism among certain G. vaginalis strains within the vaginal ecosystem.

Conclusion
The main outcomes of this work are: 1) development and validation of a novel cost-effective method for simultaneous amplicon sequencing for bacteria and fungi in complex samples; 2) development of a novel annotation approach for microbial amplicon sequencing data where the conventional homologybased search is combined with text mining for knowledge-based selection of ecosystem-speci c hits; 3) novel insights into the human vaginal ecosystem through description of the mycobiota as well as the functional attributes of the bacterial microbiome emerging from patterns of enzymatic and other metabolic capacities that are characteristic for the well-established vaginal community types (CSTs), each dominated by distinct prevalent bacteria.
Our sequencing method should also be directly applicable to environmental sciences, agriculture, biotechnology, food technology, and any other eld of science where studying bacterial and fungal populations are of particular interest. Even with its limitations, the customizable nature the text miningbased ltration approach can be adapted to a multitude of ecosystems under investigation.

Declarations
Ethics approval and consent to participate The study was approved by the ethical committee of The Hospital District of Helsinki and Uusimaa and Helsinki region hospital district (21/13/03/03/2014) and performed in accordance with the principles of the Helsinki Declaration. All participants signed an informed consent.

Consent for publication
Not applicable.

Competing Interests
The authors declare no competing interests. Work ow for the text-mining approach. A) An illustration of the annotation work ow implemented within this study. The alignments are performed with BLAST against the nucleotide data, which is restricted with an ID list for Bacteria/Fungi. The accession IDs from the alignment are keys to several NCBI databases used to extract isolation/ecosystem related information for each hit. Word banks that de ne the study speci c host/ecological niche are then used as a ltration and selection criteria, resulting in a nal list of most relevant annotations. B) The accession ID based database restrictions are very exible and can be stacked with different word banks. In our study we separately extract accession ID lists for the organism of interest (i.e. bacteria or fungi) + absence/presence of uncultured entries. The sequence alignment is performed against restricted database prior to stacking them with multiple word bank ltration in a trickle-down approach.