In-Silico Evaluation and Selection of the Best 16S rRNA Gene Primers for Use in Next-Generation Sequencing to Detect Oral Bacteria and Archaea.


 Background: Sequencing has been widely used to study the composition of the oral microbiome present in various health conditions. The extent of the coverage of the 16S rRNA gene primers employed for this purpose has not, however, been evaluated in silico using oral-specific databases. This paper analyses these primers using two databases containing 16S rRNA sequences from bacteria and archaea found in the human mouth and describes some of the best primers for each domain. Results: A total of 369 distinct individual primers were identified from sequencing studies of the oral microbiome and other ecosystems. These were evaluated against a database reported in the literature of 16S rRNA sequences obtained from oral bacteria, which was modified by our group, and a self-created oral-archaea database . Both databases contained the genomic variants detected for each included species. Primers were evaluated at the variant and species levels, and those with a species coverage (SC) ≥75.00% were selected for the pair analyses. All possible combinations of the forward and reverse primers were identified, with the resulting 4638 primer pairs also evaluated using the two databases. The best bacteria-specific pairs targeted the 3-4, 4-7 and 3-7 16S rRNA gene regions, with SC levels of 97.14-98.83%; meanwhile, the optimum archaea-specific primer pairs amplified regions 5-6, 3-5 and 3-6, with SC estimates of 95.88%. Finally, the best pairs for detecting both domains targeted regions 4-5, 3-5 and 5-9, and produced SC values of 94.54-95.71% and 96.91-99.48% for bacteria and archaea, respectively. Conclusions: Given the three amplicon length categories (100-300, 301-600 and >600 bps), the primer pairs with the best coverage values for detecting oral bacteria were: KP_F048-OP_R043 (region 3-4; primer pair position for Escherichia coli J01859.1: 342-529), KP_F051-OP_R030 (4-7; 514-1079), and KP_F048-OP_R030 (3-7; 342-1079). For detecting oral archaea, these were: OP_F066-KP_R013 (5-6; 784-undefined), KP_F020-KP_R013 (3-6; 518-undefined) and OP_F114-KP_R013 (3-6; 340-undefined). Lastly, for detecting both domains jointly they were KP_F020-KP_R032 (4-5; 518-801), OP_F114-KP_R031 (3-5; 340-801) and OP_F066-OP_R121 (5-9; 784-1405). The primer pairs with the best coverage identified herein are not among those described most widely in the oral microbiome literature.

with the consensus sequence (9). As a consequence, using an inadequate primer can lead 125 to questionable biological conclusions (9). 126 If PCR results in microbial research are to be interpreted satisfactorily, conducting a 127 comprehensive evaluation of a primer's coverage is essential (10). The concept of 128 coverage has been defined heterogeneously as: the percentage of matches for certain 129 taxonomic ranks (11,12); the number of sequences matched by at least one primer (13); 130 or the proportion of species-level taxonomic entries for each phylum in a database where 131 the prediction is that these will be amplified using a particular primer pair ( genes primers used to detect and amplify bacteria and archaea in oral samples before 158 massive sequencing; and 2) making a list of the archaeal species reported to be inhabitants 159 of the human mouth to create a database of oral-archaea 16S rRNA gene sequences. The 160 groups of words employed in these searches can be found in additional file 1. 161 Text-mining techniques were applied to all the downloaded abstracts using the R package 162 "tm" (version 0.7-7) (27). Specifically, the abstracts were tokenised, which involved the 163 classification of the words and groups of two or three words contained within them. For 164 purpose 1, publications on the study of bacterial microbiota received a score if their 165 abstracts included terms associated with the oral cavity, and another if they contained 166 terms related to a 16S rRNA gene and its different regions. A further score based on 167 archaea-associated words was used for the articles about the oral archaeome. For purpose 168 2, the studies identified in the searches seeking to uncover oral archaeal species were 169 rated and assigned an oral and an archaeal word score. The terms used to calculate the 170 scores were the same as those used in the searches. Repeated words were counted only 171 once, meaning that articles with higher scores were purely those with a greater diversity 172 of words. Ultimately, we were left with 129 bacterial and 16 archaeal studies involved 173 the use of at least one different 16S rRNA gene primer, and 53 articles containing 174 information on archaeal species (Figure 1). The references of all these papers are included 175 in Additional files 2 and 3. 176 177 Figure 1. Flowchart on the computational search of articles in PubMed and their analysis 178 using text-mining techniques. 179 180 Papers from "purpose 1" received one score for the oral cavity-words included in their abstracts and another 181 for the terms associated with the 16S rRNA gene and its different regions; papers from "purpose 2" received 182 an oral-and an archaeal-word score. In each score, for each different related term included in the abstract, 183 we gave one point with repeated words only counted once (i.e.: in a given abstract, the words "oral", 184 "mouth" and "periodontitis" appear two, one and three times so the oral cavity score is equal to three). The 185 terms used to give the punctuations were those used to conduct the searches. *Additional publications on 186 the study of the oral microbiota using sequencing were considered for full-text reading; these were 187 previously reviewed for other reasons (n= 15) or were found during the search for the oral archaea species 188 (n= 12).

190
Primer selection and creating a list of archaeal species found in the oral cavity 191 In total, we identified 444 16S rRNA gene primers: 204 forward (F), 230 reverse (R) and 192 12 unidentified (UI). Two hundred and seventy-eight of the primers were procured from 193 the searches on PubMed, being 238 and 37 used for the detection of oral bacteria or 194 archaea, respectively, and three to identify bacteria in the respiratory ecosystem. The 195 remaining 166 were extracted from articles concerning different niches, mainly 196 ecological, described in Klindworth et al. (11). Of them, 103 corresponded to the bacteria 197 domain, 42 to the archaea domain, and 21 were universal. All 444 primers were assigned 198 a unique identifier based on where they were sourced -"OP" for oral primer and "KP" for 199 Klindworth primer (11)-and their direction (F, R or UI), followed by a three-digit number 200 (Additional file 4). The 5'-3' sequences of all 444 primers were then compared to identify 201 repeats, with 75 identified as having the same sequences (Additional file 4

253
Creation of a 16S rRNA gene-sequence database of archaea 254 We searched the NCBI nucleotide database for the complete genomes of the archaeal 255 species found in the human mouth. Along with a script developed in Python, these 256 identifiers enabled us to download 193 genomes from RefSeq and eight from GenBank. 257 The script was completed using a free downloadable module, search_16S.py (42), which 258 is based on the algorithm created by Edgar (43). This allowed us to detect and extract the 259 The individual primers with a SC ≥75.00% were chosen in this stage of the research and 298 all the possible combinations between F and R were identified. We then estimated the 299 mean length between the two positions using the mean position of the first nucleotide of 300 the F primer and that of the last nucleotide of the R primer. The primer pairs had to fulfil 301 two conditions: 1) the mean position of the F primer's first nucleotide had to be lower 302 than that of the R primer's last; and 2) the minimum distance between the two means had 303 to be ≥100 nucleotides. The calculated average length was used to classify the primer 304 pairs into one of three categories relating to the mean amplicon lengths: 1) 100 to 300 305 nucleotides; 2) 301 to 600; and 3) more than 600. 306 Primer pairs obtained were evaluated against both the bacteria and archaea databases. 307 This step enabled us to determine whether a primer pair was bacteria-specific, archaea-308 specific, or suitable for both domains. 309

311
Of the 369 individual primers, 178 (103 F, 75 R) and 50 (33 F, 17 R) showed some 312 coverage value for only bacteria or only archaea, respectively. One hundred and twenty-313 four (30 F, 94 R) were able to detect both oral bacterial and archaeal species, while 17 (9 314 F, 8 R) were not able to detect any such organisms. 315 The metrics obtained using the two databases for individual primers as well as all the Bacterial-specific primer pairs 327 The pair's analysis revealed that 3218 of the 3837 bacterial-specific primer-pair 328 candidates had an archaeal VC and SC of 0.00%. On the other hand, 619 had some 329 coverage value for oral archaea (archaea VC range= 52.25% -0.04%; archaea SC range= 330 70.62% -0.52%). Relative to the mean lengths of the generated amplicons: 840 primer 331 combinations had bps of 100 to 300; 1374 from 301 to 600; and 1623 more than 600. 332 In the first length category, 139 pairs had bacterial SC values ≥ 95.00% (bacterial SC 333 range= 99.09% -95.19%), while 33 also had an archaeal SC of 0.00%. The latter were 334 used to amplify gene regions 3-4 or 5-7 and had bacterial SC values ranging from 97.92% 335 to 95.58%, which meant that 16 to 34 oral bacterial species were not covered. For most 336 of these, the mean read length of their amplicons was around 186 (range= 182 -189). 337 However, the pair OP_F009-OP_R030 from region 5-7 stood out, with a mean read length 338 of 297 and a bacterial SC value of 96.88%, with only 24 bacterial species from the oral 339 cavity were not covered by this pairing. 340 Sixty-eight primer pairs in the second amplicon length category had bacterial SC values 341 ≥ 95.00% (range= 98.83% -95.06%). Of these, 45 did not amplify any archaeal species 342 and could therefore be treated as bacterial-specific. Their bacterial SC values also ranged 343 from 98.83% to 95.06%, meaning that between nine and 38 species were not covered. In 344 addition, these pairs targeted gene regions 3-5, 3-6 or 4-7, and had maximum (max. For each amplicon-length category, we selected at least one primer pair suitable for 365 detecting only bacteria (archaeal SC= 0.00%) and which targeted distinct 16S rRNA gene 366 regions ( Table 1  Additional file 16 includes the oral species not detected by the primer pairs that were 386 found to achieve a bacterial SC ≥ 95.00% and an archaeal SC= 0.00%, as well as those 387 named previously or included in   The max. mean length was 288 bps and the min. 284. All the pairs targeted gene region 455 4-5 and had been assigned the following identifiers: KP_F020-KP_R031; KP_F020-456 OP_R070; KP_F020-KP_R032; KP_F020-KP_R035; KP_F020-OP_R020; KP_F020-457 KP_R038; KP_F020-OP_R010; KP_F020-OP_R014; KP_F020-OP_R036; and 458 KP_F020-OP_R048. The number of bacterial species that were not covered by these pairs 459 ranged from 31 to 36; for the oral archaeal species, this range was one to four. respectively, which was close to the lower limit of this category. Primer pairs formed by 480 OP_F114 with KP_R060, KP_R076 or OP_R121, which targeted region 3-9, had a better 481 balance between the coverage results (bacteria SC= 91.42%, archaea SC= 96.91%) and 482 the mean sequence lengths (1063, 1062 and 1062 bps). Sixty-six bacteria and six archaea 483 and were not covered by these pairs. 484 For each amplicon-length category, we selected at least one primer pair suitable for 485 detecting bacteria and archaea in the distinct 16S rRNA gene regions (  To the best of our knowledge, this is the first study that evaluates in silico the coverage 511 of 16S rRNA gene primers for the detection of oral bacterial and archaeal species. The 512 primer sequences were obtained not only from sequencing-based studies of the 513 microbiota inhabiting the human mouth, but also from an article containing primers used 514 in ecosystems as dissimilar as the marine, geothermal, human gut, or cattle gut (11). Thus, 515 numerous primers from diverse ecosystems were analysed to find those which performed 516 better in the oral cavity. Moreover, to perform the analysis, we improved an earlier 517 database of 16S rRNA gene sequences of oral bacteria (28) and created another from 518 scratch that contained sequences from archaeal species found in the oral cavity. 519 We identified a series of individual primers that performed well in the detection of oral 520 bacteria and/or archaea and combined them to create primer pairs. These were defined as 521 "bacteria-specific", "archaea-specific", or "bacterial and archaeal" based on the results 522 on their levels of coverage set out in the two databases. We also produced a series of 523 primer pairs that may be the most suitable combinations for use when sequencing the oral 524 ecosystem. These were classified according to the domain targeted, their mean amplicon-525 length category, and the 16S rRNA gene region amplified. 526 Like us, this group organised the most suitable primer combinations for the different 533 sequencing technologies into three categories according to their amplicon length (100-534 400, 400-1000, >1000 bp). They then re-evaluated their analysis using the Global Ocean 535 Sampling (GOS) dataset (51,52), which is limited to the marine habitat and examined 536 experimentally the primer pair that performed best (11). 537 We identified two investigations involving the oral ecosystem that used the Silva database 538 -versions 111 (14) or 132 (17)-to analyse the efficiency of 16S rRNA gene primers for 539 detecting the archaea diversity in oral samples (17), or for reconstructing the microbiome 540 of ancient dental calculus specimens (14). Also, a third study evaluated the potential of 541 seven primer pairs for detecting 219 species in a foregut dataset they created, which 542 included oral, oesophageal, and gastric 16S rRNA gene sequences (18). The pair with the 543 best results for classifying the foregut genes was also analysed against the RDP database. 544 The numbers of 16S rRNA gene primer pairs evaluated in these three oral-related studies 545 are substantially lower than in the present investigation: respectively, 12 individual 546 primers combined to form 12 primer pairs (17) were oral and, overall, they represented just 219 bacterial species. These numbers are 562 much lower than those in the bacteria database used in our study, which is based on 563 eHOMD, and to which we added sequences from our self-created archaeal dataset 564 (bacteria: 223143 sequences, 769 species; archaea: 2842 sequences, 194 species). 565 Bacteria-specific primer pairs 566 Table 4  The remaining candidates to be bacteria-specific primer pairs in all the amplicon-length 584 categories herein were better at detecting bacteria than in other studies, with differences 585 of 5.00% to 43.00%. Only Klindworth's estimate for the bacterial coverage of KP_F033-586 KP_R050 was better than ours, with an approximate difference of ~9.00% between the 587 studies (Table 4). 588 Archaea-specific primer pairs 589 Two of the primer pairs selected herein to detect oral archaea species -KP_F018-590 KP_R002 (100-300 bp) and KP_F020-KP_R013 (301-600 bp) -have previously been 591 described in other studies as the best options for targeting this domain (11). Nonetheless, 592 the archaeal SC values we obtained exceed Klindworth's by ~ 20.00% (Table 4). 593 Klindworth's group (11) also found that KP_F018-KP_R078 had the highest overall 594 archaeal coverage for the amplicons with a long mean length. However, they do not 595 recommend its use due to its low phylum spectrum. This combination produced bacterial 596 SC values of 0.00% and an archaeal SC of 93.30% when analysed against our database. 597 Although this is a good result, we prefer OP_F114-KP_R013, which achieves better 598 archaeal SC, or KP_F018-KP_R063, which produces both better archaeal SC and a 599 greater mean amplicon length (Table 4). 600 domains when employing medium mean amplicon lengths (608 bps). However, its length 614 was 622 bps in our study, meaning that it was included in the >600 bps group. In any 615 case, although our coverage values were higher than those obtained previously (SC= 616 93.63% and 96.39% vs 74.10 and 72.30%, respectively) (Table 4), other primer pairs 617 performed better in both categories as OP_F114-KP_R031 (301-600; bacterial SC= 618 95.71% and archaeal SC= 98.45%) and OP_F066-OP_R121 (>600; bacterial SC= 619 94.54% and archaeal SC= 96.91%). 620

Bacteria and archaea primer pairs
Non-covered species by the 16S rRNA gene primer pairs 621 The in-silico analysis has enabled us to verify that, among the pairs achieving better 622 coverage, the species not covered by the primers targeting a particular region tend to be 623 covered by others relating to a different zone. In this sense, most of the species that were 624 not covered by the bacterial-specific primer pairs from regions 3-4 (100-300 bps), 3-5 625 (301-600 bps), or 3-7 (>600 bps) were by those from 5-7 and 6-7 (100-300 bps), 4-7 and 626 7-9 (301-600 bps), or 4-9 (>600 bps), and vice versa (Additional file 16). This was also 627 seen in the archaeal-specific primers, where taxa not detected by the pairs from regions 3 628 (100-300 bps), 3-5 (301-600 bps) or 3-6 (>600 bps) were covered by those from 5-6 (100-629 300 bps), 3-6 and 5-9 (301-600 bps), or 3-9 and 5-9 (>600 bps), and vice versa (Additional 630 file 17). Lastly, the pairs for the two domains combined also demonstrated that species 631 not covered by primers from zones 4-5 (100-300 bps) or 3-5 (301-600 bps) were by those 632 from 5-6 (100-300 bps) or 4-6 (301-600 bps), and vice versa (Additional file 18). In the 633 combinations with mean amplicon lengths >600 bps, half the taxa that were not covered 634 using primers for amplifying region 3-9 were detected when targeting 5-9. However, in 635 this case, the opposite was not true. 636 There were exceptions to this general rule, which demonstrated that even for two primer 637 pairs targeting the same gene region, one would be able to cover most of the species that 638 were not detected by the other. As an example, the bacterial-specific pair OP_F101-639 OP_R030 covered 33 of the 66 species not detected by KP_F061-KP_R074 (6-7; 100-640 300 bps). Furthermore, several primers from region 4-7 (301-600 bps) covered more than 641 half of the bacterial species not detected by KP_F051-KP_R053 and OP_F021-KP_R053 642 (Additional file 16). Meanwhile, the archaeal-specific primer KP_F016-KP_R002 643 covered almost all of the taxa not detected using KP_F018-KP_R002, KP_F018-644 KP_R003 and KP_F018-OP_R010 (3; 100-300 bps). In addition, KP_F016-KP_R032 645 was the only primer from region 3-5 (301-600 bps) able to identify six archaea that were 646 not detected by the rest of the primers from the same region (Additional file 17). 647 It is clear that most of the taxa not detected by these well-performing primer pairs must 648 have been identified as present in the oral cavity at some point, or they would not have 649 been included in the databases used for our in-silico analysis. However, some of them are 650 microbes associated with prevalent oral pathologies, such as periodontal disease or dental 651 caries. We distinguished four recognised Campylobacter species among the bacteria not 652 detected by the bacterial-specific or bacteria and archaea primer pairs: concisus, gracilis, 653 rectus and showae. The first of these, as part of the Socransky green complex, has 654 traditionally been associated with periodontal health; the remaining three are components 655 of the orange complex, which is related to periodontitis (53). A further three bacteria 656 commonly found in the healthy periodontium, Leptotrichia bucalis (54), Leptotrichia 657 hofstadii (54) and Rothia dentocariosa (55,56), were also missed by some of the primer 658 pairs. Conversely, a few failed to cover bacterial taxa isolated from periodontally-659 diseased sites (in teeth or implants) or those regarded as novel periodontal pathogens, 660 e.g., Actinomyces dentalis (57), Actinomyces israelii (57), Desulfomicrobium orale (58), 661 Mogibacterium timidum (59), Solobacterium moorei (60,61), Treponema 662 lecithinolyticum (55,59,61) and Treponema maltophilum (62). A further Actinomyces 663 species, previously classified as naeslundii WVA 963 and now known as johnsonii (63), 664 which has been encountered in both healthy and periodontitis sites (57), was not detected 665 by some of the pairs that produced better coverage estimates. Moreover, different taxa 666 from the phyla Saccharibacteria (TM7), which growing evidence links to periodontal 667 disease (64), were also not covered. Meanwhile, the caries-associated bacterial species 668 that were not detected by some of the primer pairs included Bifidobacterium dentium 669 Although we defined which primer pairs had the highest coverage results for detecting 740 oral bacteria and archaea, this does not necessarily mean that they would be always the 741 best option for any sequencing-based research on the oral microbiome. Other factors such 742 as the amplicon length or gene region targeted, should also be taken into account when 743 selecting the optimum primer pair as we did for constructing tables 1, 2 and 3. Although 744 PCR efficiency decreases when the amplicon length increases (84), in general terms, the 745 longer the fragment sequenced, the lower the taxonomic level that can be achieved (12). Consequently, we agree that the choice of the target region is also an important factor 754 (12,24). 755

Limitations of the present study 756
The main limitation of the present study arises from the lack of information on the first 757 and last positions in the sequence annotations stored by the NCBI, which suggests that 758 primers targeting these gene regions may have lower VC values. We, therefore, calculated 759 the SC estimates, since a particular species would be regarded as covered by a particular 760 primer if at least one of its variants is amplified. In addition, we were unable to identify 761 the complete genome of some archaeal species in our database (Candidatus Korarchaeum 762 cryptofilum, Methanobrevibacter gottschalkii strain HO, Methanobrevibacter oralis, 763 Methanobrevibacter thaueri strain CW and Nitrosoarchaeum limnia). Given that the gene 764 sequences from these taxa were not obtained in the same way as for the other species, we 765 cannot be sure that there are no sequence variants in addition to those found in our 766 investigation. It should, however, be noted that the oral archaea database developed by 767 our group is the first proposal and may, therefore, be subject to change. Moreover, 768 additional scientific evidence on the archaea species associated with the oral cavity and 769 its diseases is required to increase the amount of information contained in the 16S rRNA 770 sequence databases. The results of our in-silico analysis should also be confirmed in 771 further experimental studies using omics techniques. 772

CONCLUSIONS 773
Considering the three amplicon category lengths (100-300, 301-600 and >600), the 774 primer pairs with the best estimated coverage for detecting oral bacteria targeted regions 775 3-4, 4-7 and 3-7, and these were: KP_F048-OP_R043 (primer pair position for 776 Legend Figure 1. Papers from "purpose 1" received one score for the oral cavity-words included in their 852 abstracts and another for the terms associated with the 16S rRNA gene and its different regions; papers 853 from "purpose 2" received an oral-and an archaeal-word score. In each score, for each different related 854 term included in the abstract, we gave one point with repeated words only counted once (i.e.: in a given 855 abstract, the words "oral", "mouth" and "periodontitis" appear two, one and three times so the oral cavity 856 score is equal to three). The terms used to give the punctuations were those used to conduct the searches. 857 *Additional publications on the study of the oral microbiota using sequencing were considered for full-text 858 reading; these were previously reviewed for other reasons (n= 15) or were found during the search for the 859 oral archaea species (n= 12).