Development and Multicenter Assessment of a Reference Panel for Clinical Shotgun Metagenomics for Pathogen Detection

Clinical shotgun metagenomics for used in However, its technical limited to individual single experimental Here we reported the design and development of a set of reference reagents and reporting metrics dedicated for clinical metagenomics by the National Institutes for Food and Drug Control (NIFDC) of China, and a joint evaluation study including 17 independent laboratories in cross-workow and cross-platform settings. Our results showed that the performance of metagenomic assays was signicantly impacted by the factors of microbial types, host context, and read depth, thus highlighting the importance to take these factors into consideration when designing reference reagents and benchmarking assays. Through this study, we found false positives to be a common challenge across centers, and considerable site and library effects to limited the assay’s quantitative value. Our multicenter study also provided practical guidance of performance that laboratory-developed shotgun pathogen metagenomics tests should aim to detect microbes at 500 CFU/mL (or copies/mL) in clinically relevant host context (10^5 human cells/mL) within a 24h turn-around time, and with a read depth of 20M reads or lower. This collaboration work provided a unique resource comprising nearly 600 billion reads (>5Tb) for technical evaluation in clinical and regulatory settings. We demonstrated that the performance of metagenomic assays was signicantly impacted by the microbial type, the host context and read depth, which emphasizes the importance to consider these factors when designing reference reagents and benchmarking studies. Across sites, workows and platforms, false positive reporting and considerable site/library effects were common challenges to the assay’s accuracy and quantiability. Our study also suggested practical guidance of performance that laboratory-developed shotgun pathogen metagenomics tests should aim to detect microbes at 500 CFU/mL (or copies/mL) in clinically relevant host context (10^5 human cells/mL) within a 24h turnaround time, and with a read depth of 20M reads or lower. This collaboration work provided a unique resource comprising nearly 600 billion reads (>5Tb) for technical evaluation in clinical and regulatory settings.


Introduction
Infectious diseases are a leading cause of death worldwide, attributable to a great variety of pathogens that belong to different microbial types. Rapid and precise identi cation of disease-causing pathogens is the key to effective clinical management but remains challenging in clinical settings [1,2]. Conventional diagnostics either rely on culturing, or require a presumptive diagnosis by the clinician prior testing. Recent advances in high-throughput sequencing and bioinformatics technologies have enabled rapid growth in the application of metagenomic testing to detect pathogens [3][4][5][6][7]. Importantly, the rapid identi cation of SARS-CoV-2, the causing agent of the COVID-19 pandemic, was highly attributable to the use of pathogen metagenomics assay [8][9][10][11][12].
Next-generation sequencing (NGS)-based assays have recently been widely applied in the elds of noninvasive prenatal testing and companion diagnostics for cancer treatment [13][14][15]. However, compared to these assays which analyze a limited number of genetic sites within the human genome, pathogen shotgun metagenomics faces unique challenges, as it involves a great variety of genomes from all organisms present in clinical samples [16][17][18][19]. The wide ranges of cellular and genomic characteristics of these organisms not only require that the assay can access all genetic contents (e.g. breaking all cellular structures), but also differentiate them (e.g. preventing false annotation of closely related species). So far, assessments of shotgun metagenomics for pathogen detection have been limited to reference standards by individual labs, single experimental work ow and laboratory. A multicenter evaluation study using a common set of dedicatedly designed reference reagents and performance metrics is hence highly desired, which is crucial for establishing performance standard, guiding proper result interpretation, further assay development and clinical adaptations, as well as providing valuable information in the regulatory perspective for such a newly emerging technology. Similar to the MicroArray Quality Control (MAQC) and Sequencing Quality Control (SEQC) projects, large-scale community efforts were coordinated for assessing the performance of microarray and RNA-seq technologies across laboratories, platforms and pipelines [20][21][22][23].
In this study, we described the development of a pathogen reference panel and reporting metrics dedicated for clinical shotgun metagenomics by the National Institutes for Food and Drug Control (NIFDC) of China which produces the majority of the country's o cial reference reagents, and a joint evaluation study including 17 independent laboratories in cross-work ow and cross-platform settings. In total, over 580 billion reads and over 5 Tb of sequencing data were generated and studied. To our knowledge, the current study represents the largest effort to date to produce and analyze comprehensive reference datasets for pathogen metagenomics.

Design of Pathogen Reference Panel, Reporting Metrics and Study Overview
To mimic the biological context of clinical specimens and enabling comprehensive assessments, we designed and constructed a panel of 9 pathogen reference (PR) reagents that covered 30 potentially pathogenic microorganisms of 5 different types (gram-/+ bacteria, fungi, and DNA/RNA viruses) and included 2x10^5/mL human cells as host background (table S1) [24,25]. These 30 species comprised of 19 genera, with species intentionally chosen from the same genus to test the ability of the assays to discriminate closely related microbes. It was also designed to include microorganisms with a wide range of genome sizes (from 0.7 Kb to 19.05 Mb) and GC contents (from 33.2% to 70.4%, Fig. 1A, table S2).
Among this reference panel of 9 PR samples, one served as control (pathogen reference control or PRC) and had no contrived microbes. The other 8 samples can be grouped into two sets (PRH1-PRH4 and PRL1-PRL4) or four pairs (e.g. PRH1 and PRL1). Each pair of PR samples comprised the same contrived microorganisms at two different titers. The one in the PRH group had microbes contrived at a 5-fold higher titer compared to their PRL counterpart (Fig. 1B). For instance, PR1H and PR1L both contained the same microorganisms (Escherichia coli K1, Streptococcus pneumoniae, Cryptococcus neoformans, Echovirus 11, Herpes simplex virus 1, Human betaherpesvirus 5, Human herpesvirus 6B), but each microorganism in PR1H was 5-fold greater in titer than in PRL1. Every reference reagent in the PR panel was veri ed by polymerase chain reaction (PCR)-based methods and distributed to 17 independent laboratory sites (centers C1-C17) for blinded metagenomic testing and bioinformatics analysis (Fig. 1C, Supplemental Methods). These laboratories employed various experimental procedures, bioinformatics pipelines, and sequencing platforms, which included 4 different sample preprocessing steps, two different nucleic acid extraction approaches, three different library preparation approaches, six different sequencing platforms, and four different bioinformatics methods.
To support quantitative assessments, the PR panel was tested undiluted and in 10 replicates at 1:10 dilution, except for the pathogen-free PRC sample. There were a total of 2,641 libraries sequenced (Fig.   1C, table S3, table S4), generating 587 billion reads and 5.51 TB of data. After blinded testing, each site was requested to report a list of microbes along with their mapped reads, as well as the corresponding raw sequencing data, for further independent meta-data analyses. Given the unique, agnostic nature of this assay, we assessed the results using performance metrics including the measures of Recall, Precision, and F-score to indicate the assay sensitivity, speci city, and overall accuracy (Fig. 1C).

Multicenter Evaluation of Clinical Metagenomics Using The PR Panel
We rst evaluated the diagnostic performance across the 17 sites F-scores varied considerably across the 17 laboratory sites, with a range from 0.5-1.0 and an average of 0.81. Even though only 2 out of the 17 sites achieved an overall F-score of 1.0, 59% (10 sites) achieved an F-score of >0.83 ( Fig. 2A, B). At sample level, nearly 40% of reached F-scores of over 0.9, nearly 70% were over 0.75, and only 4% were lower than 0.5 ( g. S1A). Visualization of the similarity in detected microbes demonstrated that results were clearly grouped by the reference sample, despite variances across sites (Fig. 2C).
Recall and Precision contributed differentially to the site-to-site variation in F-score ( Fig. 2A, S1B). While Recall levels remained relatively consistent (average=0.88, range: 0.75-1.0), Precision varied signi cantly across sites (average=0.77, range: 0.45-1.0). Similar observations were made at the sample level ( g. S1A). To further dissect the cross-site variation in diagnostic performance, we analyzed the TP, FP, and FN results at each site, and found FP to be the most variable across sites, ranging from 0 to 35 counts at each site ( g. S1C), while TP and FN appeared to be relatively consistent, ranging from 21-30 and 0-9 counts, respectively. These results suggest that the overall assay performance across work ows and sites was differentiated more by their ability to reduce false positive, than that to reduce false negative. Intriguingly, despite measuring two different aspects of assay performance, a signi cant positive correlation rather than trade-off was observed between Recall and Precision (P=0.013, g. S1D).
Among different microbial types, RNA viruses appeared to be the most challenging type of microbes to detect, with an average Recall of only 0.71 across all sites, signi cantly lower than that of other pathogens. Both gram-positive and gram-negative bacteria had the highest Recall among all microbial types (0.96 and 0.94), followed by DNA viruses and fungi at 0.89 and 0.80, respectively (Fig. 2D). Similar Recall patterns were observed between PRH and PRL panels ( g. S2A). Most microorganisms at titers above 200 CFU/mL or copies/mL could be detected by >50% of the sites, despite some RNA viruses that were missed at even above 100,000 copies/mL (Fig. 2E). Among all the microorganisms in our panel, fungi and RNA viruses including Echovirus 11, Human Respiratory Syncytical virus B, Human Parecho virus, Candida Albicans, and Candida Lusitaniae were the most prevalent causes of the false-negative results ( g. S2B). In line with these ndings, the ability to detect fungi and RNA viruses varied widely across sites whilst that for gram-positive and gram-negative bacteria was relatively consistent (Fig. 2E).
These emphasized the importance of using a reference panel speci cally designed for shotgun pathogen metagenomics to cover all microbial types, as many reference reagents for microbiome studies only include bacteria [26].
When assessing the technical turnaround time (TAT) of various work ows across all the sites. Fourteen sites had a TAT between 20-24 hours ranging from 15.4 to 40.0 hours. (Fig. 2F, table S5). The sequencing reaction took up the largest portion of the work ow, followed by library construction, nucleic acid extraction, and data analysis, with each constituting 66.6%, 14.3%, 5.9%, and 5.4% of the accumulated TATs, respectively (Fig. 2G, table S5).

Assay Sensitivity Depends on Microbe:Host Abundance Ratio
In the scenario of clinical specimens, pathogens almost always exist amid a variable abundance of host cells. Conventional molecular diagnostics, such as PCR-based assays, often work by detecting speci c pathogens with limited interference from human or other microorganisms. Unlike these targeted assays, shotgun metagenomic assays involve unbiased analysis of all nucleic acid molecules within a sample.
Thus, we proposed that not only the absolute pathogen abundance, but also the relative microbe:host abundance ratio may affect assay performance and should be built into the design of the reference reagents.
In our reference panel, as all samples in our panel included the same titer of human cells (2x10^5/mL), PRL therefore represented a 5-fold higher abundance than PRH in both absolute abundance and relative microbe:host abundance. On the other hand, 1:10 dilution of any sample represented a 10-fold decrease in absolute abundance, with the relative microbe:host abundance remained unchanged compared to its undiluted counterpart (Fig. 1B).
We compared the observed abundances (as indicated by the number of mapped reads) between PRHs and their PRL counterparts, and as well as between the undiluted and their diluted samples. We found that there was a 5-fold difference in median observed abundance between PRH and PRL, and 10-fold sample dilution did not result in lowered observed abundances (Fig. 3A). Consistent observations were made when bacteria, viruses, and fungi were analyzed separately (Fig. 3B). In agreement with these ndings, a lowered relative abundance in PRL resulted in a lower Recall performance (Fig. 3C), while solely reducing the absolute abundance through sample dilution did not signi cantly affect the performance ( g. S3). These data showed that the relative microbe:host abundance ratio, but not absolute microbial abundance is a key determinant of assay sensitivity by pathogen shotgun metagenomics. Therefore, the limit of detection (LoD) of this assay should be assessed and de ned with the relative abundance ratio, rather than the absolute microbial abundance as used for most conventional assays such as PCR-based diagnostics .

Intra-and Cross-site Comparison of Microbial Abundance
We set out to assess the potential of pathogen metagenomics in inferring the expected abundance from the number of reads. We de ned the expected pathogen abundance in a sample as (pathogen genome size x pathogen titer) / (human genome size x human cell titer) x the total number of clean reads, and the observed abundance as the actual number of reads. We reasoned that for metagenomics to allow relative pathogen quanti cation, there should be a linear correlation between the observed and expected abundances. Linear regression analysis showed signi cant correlations between the expected and observed abundances, either when all the pathogens were analyzed as a whole or separately according to the types of microbes (P<0.001, Fig. 3D). A similar correlation was observed when the abundance of HPV contained in HeLa cells was used as an internal control for normalization ( g. S4). It was not unexpected that the observed abundance was generally lower than the theoretical expectation ( g. S5), which might re ect the loss of microbial nucleic acids during the experimental processes such as cell wall breaking. The signi cant correlation between the observed and expected abundances, along with the recovery of the microbe:host ratio, suggested the assay's ability for intra-site relative abundance measurement.
As microbial abundance was inferred by the fraction of mapped reads, we wondered if the numbers of mapped reads could be of indicative value across sites. Numbers of mapped reads per million (RPM) varied signi cantly across sites, with differences of up to two orders of magnitudes. Such a difference in RPM was not just a result of applying different techniques, as substantial variation was still observed when sites using similar technical work ows were grouped and compared (Fig. 3E). By analyzing each key technical component in the experimental procedures, our data revealed that host depletion and column-based extraction methods were associated with higher RPMs than other technical variables, whereas library preparation by ultrasound, endonuclease, or transposase did not show signi cant effects on RPM (Fig. 3F). Adaptation of a bead-beating step was associated with a lower RPM, in agreement with its negative correlation with F-score (Fig. 3G).
These results suggest that pathogen abundance can be inferred by RPM within each site, but without a way to normalize the "site effect", cross-site comparisons provided limited information when conducting cross-center evaluation.

Library Effect Impacts Assay Variation
To understand the assay's reproducibility, we took advantage of the large replicated datasets to measure the coe cient of variations (CV) of mapped reads at each site. The average CV was 0.65 and ranged between 0.12 and 1.10; and 75% of the sites had CVs below 0.5. This variation remained at a comparable level among sites that apply similar technical work ows (Fig. 4A). A host depletion step appeared to associate with a lower CV, which might be due to its higher RPM. While no differences were observed in other processes based on cell wall breaking and various nucleic acid extraction methods, we found that endonuclease-and transpose-based library preparation demonstrated the lowest and highest CVs of 0.4 and 1.0 (Fig. 4B), respectively, implying that this was an important step that introduced variances. When different types of microorganisms were analyzed individually, we found signi cantly higher CVs for fungal detection versus bacterial or viral detection (0.80, 0.51, and 0.54, respectively) (P<0.001, Fig. 4C).
To examine how much these uctuations stemmed from read depth-dependent sampling noise, we performed random re-sampling from the pooled reads to represent such a variance and calculated the CVs of these simulated and experimental datasets. As shown in Fig. 4D, the overall CVs were signi cantly greater than the simulated CVs regardless of pathogen types. This difference in CV was consistent when each laboratory site or microbial type was assessed individually (Fig. 4C, E), suggesting that besides read depth-dependent sampling, other experimental variables also contribute considerably to the observed uctuations in metagenomic results.
We then attempted to determine how much each of the read depth-dependent variance and other experimental variables contributed to the total variance. We identi ed a signi cant linear correlation with an adjusted R 2 of 0.48 and a slope of 0.8 (P<0.001, Fig. 4F), indicating that both read depth-dependent and experimental variances contributed signi cantly to the overall uctuation. Among the potential experimental variances, a linear mixed model identi ed fungal pathogens and transposase-based library construction as signi cant contributors (table S6), which was consistent with our previous interpretations.
Our data suggest to take these variations into consideration when designing studies to evaluate such an assay. They also imply that precise quantitative measurement of pathogen abundance by shotgun metagenomics remains challenging until these variations are better understood and more sophisticated quantitative modeling is established.

Read-depth Dependency of Assay Recall
Next, we sought to understand how work ow-dependent technical variables may lead to varied site performance. Among the experimental steps, we found sample pre-treatment had a greater impact on assay performance compared to nucleic acid extraction, library preparation, or use of internal control. Preprocessing the samples with host cell depletion was signi cantly associated with improved F-scores (P<0.001). Unexpectedly, a bead-beating step designed for breaking cell walls did not always result in greater performance but was associated with an overall reduced F-scores (P<0.001). Different technical methods for nucleic acid puri cation and library preparation, different sequencing platforms, and the use of spike-in internal controls were not correlated with overall assay performance (Fig.3G, S6). Using the Q30 score as a quality indicator, we also found that F-score and Precision (but not Recall) were positively correlated with higher sequencing data quality (P<0.05, Fig. 5A).
We further explored the impact of read depth on the assay performance. Although initially, the diagnostic performance improved as the read depth increased, further increase in data size beyond 10 million did not consistently result in higher scores (Fig. 5B). This observation supports the interpretation that the contribution of read depth to assay performance plateaus after a certain read depth. Leveraging our data which constitute the deepest sequencing of any sample set yet reported, we next set out to determine the optimal read depth by assessing how well the pathogens in our panel could be detected as a function of read depth. To allow raw data analyses, a CLARK-based pipeline was chosen for subsequent siteindependent bioinformatics analyses as it demonstrated good performances in both simulated and experimental sequencing datasets [27][28][29] ( g. S7, and more details in Methods).
As shown in Fig. 5C, some pathogens could be detected with only 0.5 million total reads. For instance, site C12 achieved a full Recall of 1.0 at a read depth of 0.5 million in 6 of the 8 PR samples. Nonetheless, when considering the data from all sites, a read depth of 20 million reads enabled detection of most of the microorganisms in our panel, and above that point bene ts by deeper sequencing decreased signi cantly (Fig. 5C).
We performed sub-analyses by different microbial types of bacteria, fungi, and viruses. The performance of fungal detection plateaued at 5 million reads, while the performance of bacterial and viral detection plateaued at 10 and 20 million reads, respectively (Fig. 5D). These observations were also in line with the interpretation that the sensitivity of pathogen metagenomics decreases as the size of the microbial genome decreases (virus<bacterium<fungus), as smaller genomes result in fewer numbers of nucleic acid fragments that can be sequenced.
These ndings suggest read depth as a critical variable that impacts assay Recall when both developing and assessing metagenomic tests. Our results also indicate that although a metagenomic assay requires as few as 0.5 million reads per sample for pathogen detection under optimal conditions, in general, a read depth of 20 million was appropriate under most assay settings.

Assay Precision Is Challenged by Background Microbes
To better understand the causes of FP which we found earlier that substantially impacted assay performance, we further categorized the causes of FP results into four groups: cross-contamination, background microorganisms, species misclassi cation, and viral typing error (Fig. 6A). Among these, background microbes and misclassi cation of species were the leading causes of FP results (49% and 39%, respectively). These two causes also varied the most among sites ( g. S8).
We then sought to evaluate how much taxonomical misclassi cation could be attributed to the alignment algorithms. To ensure comprehensive microbial coverage, we included 100,000 reads each derived from a total of 108 species comprising 62 bacteria, 42 viruses, and 4 fungi into our simulated dataset and compared the alignment methods employed by the sites in this study (bwa, bowtie, and SNAP) [30][31][32] by measuring the percentages of simulated reads that were correctly or incorrectly classi ed. We found no signi cant differences at both the genus and species levels, or by microbial type ( g. S9), implying that the alignment method was not a critical performance-differentiating factor.
To gain more insights into FP results caused by background microorganisms, we compared the background patterns from all sites (Fig. 6B). Nine prevalent microorganisms were presented in >5 sites, while others were more site-speci c (Fig. 6C). Background microbial patterns clustered partially depending on the methods of library construction and nucleic acid extraction used (Fig. 6B). These ndings implied that microbial backgrounds can be derived from both common and work ow-speci c sources and that addressing such issues to improve assay Precision would require site-dependent approaches.
Besides read counts, we also examined whether genome coverage and regional sequencing depth could be informative in discriminating TP from FN. We de ned genome coverage as the fraction of genome covered by metagenomic sequencing, and the regional sequencing depth as the total sequencing length divided by the covered genomic fraction. TP results were associated with signi cantly lower regional sequencing depth and higher genome coverage ( g. S10), which was consistent with the fact that these microorganisms exist in the samples as full and uniform genomes. Similar observations were also made for FN results that were missed originally but discovered by our site-independent bioinformatics pipeline. All FP detections showed signi cantly lower levels of genome coverage. A signi cant increase in the level of regional depth was also found in background microbes, implying that they presented as genomic fragments instead of full microbial bodies or genomes in the samples.
Taking all these factors into consideration, we built a Precision lter that identi ed potential FP results through machine learning and applied it to the data derived from site C14, the site that had the highest level of FP results. Integrating RPM, RPM ratio (sample:control), and genome coverage, our method reduced FP results from 34 to 11 counts (Fig. 6D). Importantly, applying such a lter did not compromise Recall, suggesting a potential strategy for improving the precision of pathogen metagenomics.

Discussion
In this coordinated study with large-scale community efforts, nine reference samples that mimicked the context of clinical specimens of infectious diseases were pro led at 17 independent sites with various work ows. The data presented here provide one of the deepest assessments of pathogen metagenomics assay to date.
TAT is highly critical in these scenarios [39]. In this multicenter study, we found that most of the pathogen metagenomic assays could be completed within 24 hours, with the sequencing step as the major timeconsuming step. Given that longer read lengths generally require more reaction time in most sequencing platforms, these indicate that modifying the assay with shorter reads could be an effective and relatively straightforward strategy for achieving faster assay turn-around.
Lowering the cost by applying an appropriate read depth may allow the wider application of pathogen metagenomics [40]. Our analyses found that across most of the sites, 20 million reads were su cient for pathogen identi cation. We also observed that under optimal assay conditions, pathogen detection could be achieved with as low as 0.5 million reads, indicating the potential of further cost reduction as the technology advances [41].
By including various types of pathogens as well as human cells in the design, our reference panel represent the common context of clinical specimens, such as cerebrospinal uid, sputum, and bronchoalveolar lavage uid, where in ltration of immune cells is often found under infection. Although sharing common characteristics, our reference samples may not precisely represent plasma specimens where human cell-free nucleic acids are believed to be more prevalent [3]. For instance, it remains to be determined whether assay variation would be in uenced by the cell wall-breaking step, and how each library preparation method ts in the context of cell-free nucleic acids.
We demonstrated the abundance of human cells as a critical factor by showing that pathogen detection by metagenomics was directly affected by the relative abundance of pathogens to host cells. Our data supported a mathematical model in which a sample comprising 10^5/mL each of human cells and bacterial cells would only yield 0.1% of total reads mapped to the bacteria, assuming a human genome of 3Gb and a bacterial genome of 3Mb [16,25]. When interpreting results from pathogen metagenomics, it is important to note that its sensitivity could be affected by the host nucleic acids, and therefore varied across samples [16]. Host nucleic acid depletion would therefore be highly valuable in improving the read yield by increasing the relative microbial abundance and eventually in lowering read depth and assay costs. A variety of approaches have been reported for host cell depletion [42][43][44][45]. However, cautious validation should be performed before applying these methods to ensure that pathogens are not biasedly "co-depleted" with the host cells during the process. Indeed, in a previous study, differential lysis could signi cantly reduce human cells but at the same time compromise detection of viral and certain bacterial pathogens [43]. Different from the other targeted assays, pathogen metagenomics is also unique in its potential for unbiased detection of novel pathogenic microbes, as was shown in the discovery of COVID-19. Such ability heavily depends on bioinformatics analysis to discriminate between novel and previously identi ed pathogens, as well as closely related ones, for instance, between SARS-CoV-2 and SARS-CoV.
Evaluating such an unusual aspect of assay performance would require new designs of the reference reagents that represent potential novel species.
Data in this study included sequencing results generated from different platforms and work ows using the same set of reference samples. This information is a unique resource that could be valuable for the development and optimization of bioinformatics pipelines for rapid pathogen detection. Current bioinformatics pipelines mostly rely on the number of mapped reads for pathogen identi cation [46][47][48]. With our dataset, more sophisticated identi cation algorithms could be explored by integrating more variables, such as genome coverage and phylogenetic relationships, to improve speci city. These data also provide a general overview of the current performance of pathogen metagenomics, which could aid in establishing regulatory or technical references.

Conclusion
We have reported the design and development of a set of reference reagents and reporting metrics dedicated for shotgun genomics for pathogen detection which can help standardise the eld of clinical metagenomics. By testing these reference reagents in a multicenter study that included 17 independent laboratories, we demonstrated that the performance of metagenomic assays was signi cantly impacted by the microbial type, the host context and read depth, and emphasized the importance to consider these factors when designing reference reagents and benchmarking studies. Moreover, across sites, work ows and platforms, we found false positive reporting and considerable site/library effects to be common challenges to the assay's accuracy and quanti ability. Our study also suggested practical guidance of performance for laboratory developed shotgun pathogen metagenomics assays. Multiplex PCR System (bioMérieux, Craponne, France), and quanti tated by standard plate counts. Viral organisms were validated by Sanger sequencing and quantitated by droplet digital PCR (ddPCR). These microbes were then placcontrived into PBS solutions with 2x10 5 /ml of HeLa cells (ATCC) at indicated concentrations (table S1). In the PRH group, these microorganisms were studicontrived at 200-350,000 CFU/ml for bacterial and 400-10,000 CFU/ml for fungal pathogens, at 660-2,000,000 copies/ml for DNA viruses and at 140-3,500,000 copies/ml for RNA viruses to represent common ranges of clinical infection.

Comparison of bioinformatics pipelines
We used Mason (Mason -A Read Simulator for Next Generation Sequencing Data, v0.1.2) to generate simulated sequencing data for 108 microbial genomes, which including 62 bacterial, 42 viral, and 4 fungal microorganisms. A total of 100,000 single-end, 75bp reads were generated for each microbe and subjected to taxonomic identi cation by Centrifuge [27], Kraken [29], and CLARK [28] pipelines separately. Assessment of pipeline performance was performed at both the genus and species levels and also by microbial class. Sensitivity was inferred by the number of reads mapped speci cally to the correct taxa; and speci city was inferred by the percentage of reads mapped speci cally to the incorrect taxa.
Statistical comparisons were as done by Wilconxon rank tests.
As Similar strategy was used for the comparing the son among alignment algorithms of BWA [30], Bowtie2 [31], and SNAP [32].

Correlation analysis of observed and theoretical abundances
Raw sequencing data from 17 sites were analyzed using the site-independent CLARK-based pipeline to obtain the observed abundance of each microorganism in each sample, except for RNA viruses. The theoretical abundance of a microorganism in a sample was proportional to the ratio of DNA of that microorganism and the size of sequencing data, calculated as below: Analysis of Simulated and Observed Coe cient of Variants (CVs) Fastq data from the 10 repeated replicates of each sample were merged, re-split randomly according to the original read depth (number of reads), and analyzed by the CLARK-based pipeline. The CVs were calculated based on the read numbers mapped to each microbes. This above process was repeated 10 times in order to obtain a total of 10 simulated CVs for each microbe. We used the average of 10 simulated CVs, as the expected CV resulted from read depth variation. A linear regression model was used to evaluate the contribution of the data size CVs to the observed CVs.
In addition, we used a linear mixed model to further evaluate whether the sequencing platform, library method, and class of microorganism affected the observed CV. The formula of the linear mixed model was de ned as: Cv_observed ~ Cv_datasize + Library Prep + Microbial Class + Platform + (1+Cv_observed|Center) where Center was a random effect, and the read depth CV (Cv_datasize), library preparation method (Library Prep), microbial class, and sequencing platform (Platform) were xed effects.

Analysis of Read-depth Requirement for Pathogen Detection
Fastq data from the 10 repeated replicates of each sample were merged, and resampled to the desired read depth of 0.5M, 1M, 5M, 10M, 20M, 30M, and 50M total reads for each sample. The re-sampled data were analyzed by the CLARK-based pipeline and identi cation of a microorganism was de ned by over 4 species-speci c mapped reads in a sample. The recall performance was assessed at each indicated read depth for each site by sample or by microbial class.

Declarations
Ethics approval and consent to participate: Not applicable Consent for publication: Not applicable 46. Breitwieser FP, Baker DN, Salzberg SL. KrakenUniq: con dent and fast metagenomics classi cation using unique k-mer counts.  Table S1. Design and composition of the pathogen panel. Table S2. Pathogen panel genome characteristics. Table S3. Sites and their corresponding technical parameters. Table S4. Library numbers and data size in the study.