Bias of the Mock Community Composed of Common Strains Isolated from Clinical Specimens

Background: The mock communities (MCs) bias of common species from clinical specimens should be considered when we use a microbiome in the clinical microbiology laboratory. The aims of this study were to MCs using clinically important common species and to investigate the bias of MC use in the clinical laboratory. Results: Five MCs incorporating 32 bacterial strains isolated from clinical specimens were included. We analyzed the diversity of operational taxonomic units (OTU), the relative abundance, and the taxonomic assignment using paired-end sequencing of the 16S rRNA V3−V4 region. The Shannon index revealed the best correlation with the actual number of MC species for diversity. We determined that the OTU with relative abundance (log) of 0.001 is most appropriate to explain the community. The relative abundance was higher in Bacteroidetes and Fusobacteria; however, low relative abundance was shown in Aeromonas caviae, Burkholderia cepacia, Enterobacter cloacae, Pseudomonas aeruginosa, and Clostridioides dicile. It is suggested that the relative abundance was low in gram-positive bacteria and those with 60%−70% GC content but was not related to genome size or 16S rRNA copy number. Conclusions: We investigated the characteristics of MC composed as common clinical isolates and conrmed the bias of MC. These data could be used for microbiome research on clinical specimens.


Background
Attempts to use microbiome analysis in a clinical laboratory are increasing with the enhancement of next-generation sequencing (NGS) technology [1][2][3]. However, the method has a few serious limitations for such use, such as a lack of standardization of the research process, distortion of bioinformatics, and inconsistent results [4][5][6].
Recently, it was recommended to use a mock community (MC) to overcome these limitations and increase the reliability of the results [7,8]. An MC is composed of several bacterial strains, and we can establish the standardization of experimental process including DNA extraction, PCR, NGS, and bioinformatics pipelines using MC [7][8][9][10][11]. We could compare the data among researchers and reduce unnecessary duplication by others [12][13][14].
The bias limits can be reduced by using MC [12,14], although it cannot replace the whole microbial community characteristics of specimens [15]. A few commercial MCs are provided by the American Type Culture Collection, BEI resource, and Zymo Research, although these are limited to use for basic research in speci c sites such as oral tissue, skin, gut, and vagina. It is important to consider the GC contents, genome size, results of gram strain, and 16S rRNA copy number when we design a new MC.
The aims of this study were to design MCs using clinically important common species and to investigate the bias of MCs for use in the clinical laboratory

Mock community
Five MCs using 32 bacteria were constructed to include clinically common and important bacteria ( Table 1). All bacterial species were con rmed by biochemical testing, MALDI-TOF MS, and 16S rRNA sequencing.
The criteria for each MC were as follows: the genome size (range 1.8Mb to 5.8Mb; mean: 3.9Mb) in MC1, s GC content (range 27.0%−66.7%; mean: 44.18%) in MC2, 16S rRNA copy number (range 4−12; mean: 6) in MC 3, and 4, a total of 32 bacteria and an even distribution of gram staining results in MC 5.
We used phosphate-buffered saline as a negative control for DNA library construction.
NGS and bioinformatic pipelines DNA extraction from MCs was performed using the Maxwell 16 LEV Blood DNA Kit (Promega, USA). The DNA library construction and NGS analysis were performed by paired-end sequencing of V3−V4 regions according to the manual of the Illumina MiSeq System. The EzBioCloud pipeline (https://www.ezbiocloud.net/contents/16smtp) was used for NGS data analysis. In preprocessing, merging of paired-end reads was performed using the VSEARCH program, and low-quality reads (<Q25) were ltered [16]. As many as 100,000 reads were used for data analysis, and QC processing was performed to exclude low-quality, non-target, and chimeric amplicons. The UCHIME with chimera-free reference DB was used to detect any chimera [17]. Operational taxonomic unit (OTU) picking was performed by UCLUST using an open-reference method. Taxonomic assignment was performed through VSEARCH using the EzBioCloud 16S database and determined on the basis of 97% of 16S similarity [16,18].

Diversity of OTU
The species richness of α-diversity was estimated according to the ACE [19], Chao1 [20], and Jackknife [21] methods. The species evenness was calculated by the Shannon and Simpson index [22].

Relative abundance and interpretation of taxonomic assignment
We used "fold error" to calculate the bias of relative abundance. Fold error is de ned as the relative abundance of the ideal value, and the ideal value is calculated the same way for each MC (100 divided by the number of species). The Ez-BioCloud 16S database uses a concept of "group," It classi es a group of several bacterial species that cannot be distinguished by 16S rRNA sequences. Of the 32 species included in this study, 23 were identi ed as a "group": Klebsiella aerogenes, Klebsiella pneumoniae, and Salmonella enterica were included in the Enterobacteriaceae group ( Table 1). The Staphylococcus aureus group included S. aureus and S. epidermidis, and the Streptococcus pneumoniae group included S. pneumoniae and S. pyogenes. For these species, we divided the total relative abundance of the group by the number of species for calculation of the fold error. Others were de ned as the results of misidenti ed species other than those of the expected MC species.

Statistics
Correlation analysis was calculated by the Pearson correlation coe cient, and polynomial logistics regression analysis was performed for bacterial variance. The p value was calculated by a two-side test, and it was considered signi cant when it equaled < 0.05.

Change of NGS raw data during bioinformatics pipeline
Total reads for all MCs were at least 100,000 each, ranging from 123,255 (MC2) to 184,711 (MC5) ( Table 2). After pre-lterimg, total reads were between 84,090 (MC4) and 100,000 (MC1, 3, and 5). All 5 MCs showed su cient valid reads of more than 70,000. The percent of identi ed reads for MC2 and MC4 were 98.9% and 98.0%, respectively. The identi cation rates of MC1 (67.4%), MC3 (87.1%), and MC5 (89.6%) were relatively low. However, this was caused by the inclusion of a strain of Sphingobacterium, a suspected new species. When we treat the result of this strain (Sphingobacterium_uc) as correct, the nal identi cation rates of all MCs were more than 98%. There are differences in the identi ed reads among 5 MCs although each MC contains the same CFUs of bacteria. The rates of misidenti ed reads ranged from 0.9% in MC3 to 3.6% in MC4.

Taxonomic assignment
The number of OTUs among the 5 MCs was between 65 (MC4) and 126 (MC5), and the number of species was 48 (MC1)−89 (MC5) (Table 2, Figure 1A). On average, 1.27 (MC2) to 2.10 (MC1 OTUs were matched to one species. The ratio of the number of detected species to that of expected species was between 2.8 (MC4 and MC5) and 5.3 (MC1). The ratio of those at the genus level was between 1.8 (MC5) and 4.0 (MC1). These ratios were slightly decreased when an MC contained a large number of species.

Calculation of α-diversity by OTU
The OTU, ACE, Chao1, and Jackknife indices re ecting species richness showed a moderate correlation (r = 0.56−0.62) with the number of expected MC species (Table 2). For species evenness, the correlation between the Shannon index (range: 1.85−2.65) and the number of expected MC species was high (r = 0.82).

Relative abundance at the species level
There were signi cant differences in the relative abundance, although all expected MC species were identi ed to the species level (Figures 2 and 3). At the phylum, the relative abundance was highest in Bacteroidetes and also high in Fusobacteria within each MC. The relative abundance of Fusobacterium was highest (26.42%) in MC2 because MC2 does not contain Bacterioidetes. In this study, Bacteroidetes compromise Bacteroides fragilis (6.8−7.8), Chryseobacterium gleum group  A total of 422 species with 30 included in MCs were identi ed in the negative control ( Figure 2). The relative abundance of the 30 species of MCs ranges from a minimum of 0.001% (Enterococcus faecalis) to a maximum of 0.58% (Enterobacteriaceae group).
Misidenti ed results at the species level of 5 MCs We de ned the misidenti ed results in the MCs when the nal identi cation results showed a different species in the same genus or a different genus from a speci c species. A different species in the same genus was con rmed in most species, and most of them were of the Enterobacteriaceae group, Enterococcus, and Streptococcus among MCs (Figure 4). However, the relative abundance of these was low: between 0.46% (MC3) and 1.11% (MC5). Corynebacterium striatum, Clostridioides di cile, Clostridium perfringens, Aeromonas caviae, Hemophilus in uenzae, and Stenotrophomonas maltophilia were identi ed at the species level. Pantoea, Erwinia, Cronobacter, Cosenzaea, and Raoultella were identi ed although they were not included in the MCs (Table 3).

Difference of relative abundance by bacterial characteristics
The fold error was signi cantly lower in gram-positive bacteria (Figure 3). There was no difference of fold error by 16S rRNA copy number or genome size. Bacteria with GC contents of 60% to 70% showed signi cantly lower fold error.

Discussion
For the diagnostic use of the microbiome in the clinical microbiology laboratory, it is necessary to verity the experimental procedures and data analysis using bacteria isolated from clinical specimens [23,24]. In this study, we tried to con rm the bias of microbiomes using 5 MCs incorporating 32 clinical isolates.
It is well known that at least 60,000 raw reads are necessary to analyze a microbiome using biological samples. It has been reported that more than 100,000 raw reads are essential for reliable community analysis [25]. In this study, we could obtain more than 100,000 raw reads with 70,000 valid reads. We believe that 60,000 valid reads are enough to analyze the microbial community because there are no changes of OTU, and the rarefaction threshold was reached at 60,000 reads in all MCs.
The OTU to species ratio was 1.3−2.1. This is attributable to the misperception of sequencing error as OTU [26]. We could identify all species included in the 5 MCs when analyzing OTUs with a relative abundance (log) of 0.001 or higher. We believe that the use of the relative abundance (log) 0.001 can eliminate sequencing errors, and we could perform the community analysis accurately.
The species diversity can be expressed as the species richness and evenness. The richness index was different from the actual number of species in MC when we use ACE, CHAO, and Jackknife. When we calculated the Shannon and Simpson indices for species evenness, the former was highly correlated with the actual number of species in the MC. So, we concluded that the Shannon index is the best indicator of the species diversity for an MC as in a previous report [27].
For many bacteria, a partial sequence (V3−V4 region) of the 16S rRNA gene is identical or very similar in many bacteria, so it is common that one strain is expressed as two species, although it may have the opposite effect. However, the Ez-Biocloud 16S database has the advantage of reducing this erroneous result with the use of a "group." This composes as a group for several species that cannot be distinguished by 16S rRNA amplicons. In this study, 23 of 32 species were reported as a group in this database. It is possible to analyze the microbiome data to the species level if we use the Ez-Biocloud 16S database including the concept of "group," although many previous reports allow the identi cation only to the genus level.
The relative abundance was different even though the same contents of bacteria were included in each MC. We con rmed that the relative abundance was extremely low in P. aeruginosa, E. cloacae, A. caviae, C. di cile, B. cepacia, and S. aureus (Figure 2). So, it should be noted that it is likely to be calculated markedly less than the original for these species when we use clinical specimens for the microbiome. In addition, these should be considered to determine the cut-off value of the community.
It has been reported that relative abundance differs according to phylum, bacterial cell wall, GC content, genome size, and 16S rRNA copy number [28,29]. We also con rmed the phylum of Bacteroidetes (B. fragilis, Sphingobacterium sp., C. indologenes), and Fusobacterium (F. nucleatum) showed high relative abundance. In addition, the relative abundance of gram-positive bacteria and GC contents of 60% to 70% was low. This bias can occur during the process of DNA extraction and PCR ampli cation for microbiome analysis using NGS [8,29].

Conclusion
We investigated the characteristics of MC composed as common clinical isolates and con rmed the bias of MC. These data could be used for microbiome research on clinical specimens.

Declarations
Ethics approval and consent to participate Not applicable Consent for publication Not applicable Availability of data and material All the assemblies were deposited into the NCBI SRA database with the following accession number: PRJNA801011

Competing interests
The authors declare that they have no competing interests  Table 1 Composition and characteristics of mock communities   (20) (Similar species) † † * , After pre-ltering, no more than 100,000 reads were used for analysis. † , Number of reads identi ed at the species level. ‡ , Ratio of each read divided by valid reads. § , Number of reads misidenti ed at the species level. ¶ , Correlation coe cient with number of MC species. ** , Contains species indistinguishable from the 16s rRNA V3-V4 region. † † , Identi ed bacteria as species similar to MC species.