Design of Pathogen Reference Panel, Reporting Metrics and Study Overview
To mimic the biological context of clinical specimens and enabling comprehensive assessments, we designed and constructed a panel of 9 pathogen reference (PR) reagents that covered 30 potentially pathogenic microorganisms of 5 different types (gram-/+ bacteria, fungi, and DNA/RNA viruses) and included 2x10^5/mL human cells as host background (table S1) [24, 25]. These 30 species comprised of 19 genera, with species intentionally chosen from the same genus to test the ability of the assays to discriminate closely related microbes. It was also designed to include microorganisms with a wide range of genome sizes (from 0.7 Kb to 19.05 Mb) and GC contents (from 33.2% to 70.4%, Fig. 1A, table S2).
Among this reference panel of 9 PR samples, one served as control (pathogen reference control or PRC) and had no contrived microbes. The other 8 samples can be grouped into two sets (PRH1-PRH4 and PRL1-PRL4) or four pairs (e.g. PRH1 and PRL1). Each pair of PR samples comprised the same contrived microorganisms at two different titers. The one in the PRH group had microbes contrived at a 5-fold higher titer compared to their PRL counterpart (Fig. 1B). For instance, PR1H and PR1L both contained the same microorganisms (Escherichia coli K1, Streptococcus pneumoniae, Cryptococcus neoformans, Echovirus 11, Herpes simplex virus 1, Human betaherpesvirus 5, Human herpesvirus 6B), but each microorganism in PR1H was 5-fold greater in titer than in PRL1. Every reference reagent in the PR panel was verified by polymerase chain reaction (PCR)-based methods and distributed to 17 independent laboratory sites (centers C1-C17) for blinded metagenomic testing and bioinformatics analysis (Fig. 1C, Supplemental Methods). These laboratories employed various experimental procedures, bioinformatics pipelines, and sequencing platforms, which included 4 different sample preprocessing steps, two different nucleic acid extraction approaches, three different library preparation approaches, six different sequencing platforms, and four different bioinformatics methods.
To support quantitative assessments, the PR panel was tested undiluted and in 10 replicates at 1:10 dilution, except for the pathogen-free PRC sample. There were a total of 2,641 libraries sequenced (Fig. 1C, table S3, table S4), generating 587 billion reads and 5.51 TB of data. After blinded testing, each site was requested to report a list of microbes along with their mapped reads, as well as the corresponding raw sequencing data, for further independent meta-data analyses. Given the unique, agnostic nature of this assay, we assessed the results using performance metrics including the measures of Recall, Precision, and F-score to indicate the assay sensitivity, specificity, and overall accuracy (Fig. 1C).
Multicenter Evaluation of Clinical Metagenomics Using The PR Panel
We first evaluated the diagnostic performance across the 17 sites F-scores varied considerably across the 17 laboratory sites, with a range from 0.5-1.0 and an average of 0.81. Even though only 2 out of the 17 sites achieved an overall F-score of 1.0, 59% (10 sites) achieved an F-score of >0.83 (Fig. 2A, B). At sample level, nearly 40% of reached F-scores of over 0.9, nearly 70% were over 0.75, and only 4% were lower than 0.5 (fig. S1A). Visualization of the similarity in detected microbes demonstrated that results were clearly grouped by the reference sample, despite variances across sites (Fig. 2C).
Recall and Precision contributed differentially to the site-to-site variation in F-score (Fig. 2A, S1B). While Recall levels remained relatively consistent (average=0.88, range: 0.75-1.0), Precision varied significantly across sites (average=0.77, range: 0.45-1.0). Similar observations were made at the sample level (fig. S1A). To further dissect the cross-site variation in diagnostic performance, we analyzed the TP, FP, and FN results at each site, and found FP to be the most variable across sites, ranging from 0 to 35 counts at each site (fig. S1C), while TP and FN appeared to be relatively consistent, ranging from 21-30 and 0-9 counts, respectively. These results suggest that the overall assay performance across workflows and sites was differentiated more by their ability to reduce false positive, than that to reduce false negative. Intriguingly, despite measuring two different aspects of assay performance, a significant positive correlation rather than trade-off was observed between Recall and Precision (P=0.013, fig. S1D).
Among different microbial types, RNA viruses appeared to be the most challenging type of microbes to detect, with an average Recall of only 0.71 across all sites, significantly lower than that of other pathogens. Both gram-positive and gram-negative bacteria had the highest Recall among all microbial types (0.96 and 0.94), followed by DNA viruses and fungi at 0.89 and 0.80, respectively (Fig. 2D). Similar Recall patterns were observed between PRH and PRL panels (fig. S2A). Most microorganisms at titers above 200 CFU/mL or copies/mL could be detected by >50% of the sites, despite some RNA viruses that were missed at even above 100,000 copies/mL (Fig. 2E). Among all the microorganisms in our panel, fungi and RNA viruses including Echovirus 11, Human Respiratory Syncytical virus B, Human Parecho virus, Candida Albicans, and Candida Lusitaniae were the most prevalent causes of the false-negative results (fig. S2B). In line with these findings, the ability to detect fungi and RNA viruses varied widely across sites whilst that for gram-positive and gram-negative bacteria was relatively consistent (Fig. 2E). These emphasized the importance of using a reference panel specifically designed for shotgun pathogen metagenomics to cover all microbial types, as many reference reagents for microbiome studies only include bacteria [26].
When assessing the technical turnaround time (TAT) of various workflows across all the sites. Fourteen sites had a TAT between 20-24 hours ranging from 15.4 to 40.0 hours. (Fig. 2F, table S5). The sequencing reaction took up the largest portion of the workflow, followed by library construction, nucleic acid extraction, and data analysis, with each constituting 66.6%, 14.3%, 5.9%, and 5.4% of the accumulated TATs, respectively (Fig. 2G, table S5).
Assay Sensitivity Depends on Microbe:Host Abundance Ratio
In the scenario of clinical specimens, pathogens almost always exist amid a variable abundance of host cells. Conventional molecular diagnostics, such as PCR-based assays, often work by detecting specific pathogens with limited interference from human or other microorganisms. Unlike these targeted assays, shotgun metagenomic assays involve unbiased analysis of all nucleic acid molecules within a sample. Thus, we proposed that not only the absolute pathogen abundance, but also the relative microbe:host abundance ratio may affect assay performance and should be built into the design of the reference reagents.
In our reference panel, as all samples in our panel included the same titer of human cells (2x10^5/mL), PRL therefore represented a 5-fold higher abundance than PRH in both absolute abundance and relative microbe:host abundance. On the other hand, 1:10 dilution of any sample represented a 10-fold decrease in absolute abundance, with the relative microbe:host abundance remained unchanged compared to its undiluted counterpart (Fig. 1B).
We compared the observed abundances (as indicated by the number of mapped reads) between PRHs and their PRL counterparts, and as well as between the undiluted and their diluted samples. We found that there was a 5-fold difference in median observed abundance between PRH and PRL, and 10-fold sample dilution did not result in lowered observed abundances (Fig. 3A). Consistent observations were made when bacteria, viruses, and fungi were analyzed separately (Fig. 3B). In agreement with these findings, a lowered relative abundance in PRL resulted in a lower Recall performance (Fig. 3C), while solely reducing the absolute abundance through sample dilution did not significantly affect the performance (fig. S3). These data showed that the relative microbe:host abundance ratio, but not absolute microbial abundance is a key determinant of assay sensitivity by pathogen shotgun metagenomics. Therefore, the limit of detection (LoD) of this assay should be assessed and defined with the relative abundance ratio, rather than the absolute microbial abundance as used for most conventional assays such as PCR-based diagnostics .
Intra- and Cross-site Comparison of Microbial Abundance
We set out to assess the potential of pathogen metagenomics in inferring the expected abundance from the number of reads. We defined the expected pathogen abundance in a sample as (pathogen genome size x pathogen titer) / (human genome size x human cell titer) x the total number of clean reads, and the observed abundance as the actual number of reads. We reasoned that for metagenomics to allow relative pathogen quantification, there should be a linear correlation between the observed and expected abundances. Linear regression analysis showed significant correlations between the expected and observed abundances, either when all the pathogens were analyzed as a whole or separately according to the types of microbes (P<0.001, Fig. 3D). A similar correlation was observed when the abundance of HPV contained in HeLa cells was used as an internal control for normalization (fig. S4). It was not unexpected that the observed abundance was generally lower than the theoretical expectation (fig. S5), which might reflect the loss of microbial nucleic acids during the experimental processes such as cell wall breaking. The significant correlation between the observed and expected abundances, along with the recovery of the microbe:host ratio, suggested the assay’s ability for intra-site relative abundance measurement.
As microbial abundance was inferred by the fraction of mapped reads, we wondered if the numbers of mapped reads could be of indicative value across sites. Numbers of mapped reads per million (RPM) varied significantly across sites, with differences of up to two orders of magnitudes. Such a difference in RPM was not just a result of applying different techniques, as substantial variation was still observed when sites using similar technical workflows were grouped and compared (Fig. 3E). By analyzing each key technical component in the experimental procedures, our data revealed that host depletion and column-based extraction methods were associated with higher RPMs than other technical variables, whereas library preparation by ultrasound, endonuclease, or transposase did not show significant effects on RPM (Fig. 3F). Adaptation of a bead-beating step was associated with a lower RPM, in agreement with its negative correlation with F-score (Fig. 3G).
These results suggest that pathogen abundance can be inferred by RPM within each site, but without a way to normalize the “site effect”, cross-site comparisons provided limited information when conducting cross-center evaluation.
Library Effect Impacts Assay Variation
To understand the assay’s reproducibility, we took advantage of the large replicated datasets to measure the coefficient of variations (CV) of mapped reads at each site. The average CV was 0.65 and ranged between 0.12 and 1.10; and 75% of the sites had CVs below 0.5. This variation remained at a comparable level among sites that apply similar technical workflows (Fig. 4A). A host depletion step appeared to associate with a lower CV, which might be due to its higher RPM. While no differences were observed in other processes based on cell wall breaking and various nucleic acid extraction methods, we found that endonuclease- and transpose-based library preparation demonstrated the lowest and highest CVs of 0.4 and 1.0 (Fig. 4B), respectively, implying that this was an important step that introduced variances. When different types of microorganisms were analyzed individually, we found significantly higher CVs for fungal detection versus bacterial or viral detection (0.80, 0.51, and 0.54, respectively) (P<0.001, Fig. 4C).
To examine how much these fluctuations stemmed from read depth-dependent sampling noise, we performed random re-sampling from the pooled reads to represent such a variance and calculated the CVs of these simulated and experimental datasets. As shown in Fig. 4D, the overall CVs were significantly greater than the simulated CVs regardless of pathogen types. This difference in CV was consistent when each laboratory site or microbial type was assessed individually (Fig. 4C, E), suggesting that besides read depth-dependent sampling, other experimental variables also contribute considerably to the observed fluctuations in metagenomic results.
We then attempted to determine how much each of the read depth-dependent variance and other experimental variables contributed to the total variance. We identified a significant linear correlation with an adjusted R2 of 0.48 and a slope of 0.8 (P<0.001, Fig. 4F), indicating that both read depth-dependent and experimental variances contributed significantly to the overall fluctuation. Among the potential experimental variances, a linear mixed model identified fungal pathogens and transposase-based library construction as significant contributors (table S6), which was consistent with our previous interpretations.
Our data suggest to take these variations into consideration when designing studies to evaluate such an assay. They also imply that precise quantitative measurement of pathogen abundance by shotgun metagenomics remains challenging until these variations are better understood and more sophisticated quantitative modeling is established.
Read-depth Dependency of Assay Recall
Next, we sought to understand how workflow-dependent technical variables may lead to varied site performance. Among the experimental steps, we found sample pre-treatment had a greater impact on assay performance compared to nucleic acid extraction, library preparation, or use of internal control. Preprocessing the samples with host cell depletion was significantly associated with improved F-scores (P<0.001). Unexpectedly, a bead-beating step designed for breaking cell walls did not always result in greater performance but was associated with an overall reduced F-scores (P<0.001). Different technical methods for nucleic acid purification and library preparation, different sequencing platforms, and the use of spike-in internal controls were not correlated with overall assay performance (Fig.3G, S6). Using the Q30 score as a quality indicator, we also found that F-score and Precision (but not Recall) were positively correlated with higher sequencing data quality (P<0.05, Fig. 5A).
We further explored the impact of read depth on the assay performance. Although initially, the diagnostic performance improved as the read depth increased, further increase in data size beyond 10 million did not consistently result in higher scores (Fig. 5B). This observation supports the interpretation that the contribution of read depth to assay performance plateaus after a certain read depth. Leveraging our data which constitute the deepest sequencing of any sample set yet reported, we next set out to determine the optimal read depth by assessing how well the pathogens in our panel could be detected as a function of read depth. To allow raw data analyses, a CLARK-based pipeline was chosen for subsequent site-independent bioinformatics analyses as it demonstrated good performances in both simulated and experimental sequencing datasets [27-29] (fig. S7, and more details in Methods).
As shown in Fig. 5C, some pathogens could be detected with only 0.5 million total reads. For instance, site C12 achieved a full Recall of 1.0 at a read depth of 0.5 million in 6 of the 8 PR samples. Nonetheless, when considering the data from all sites, a read depth of 20 million reads enabled detection of most of the microorganisms in our panel, and above that point benefits by deeper sequencing decreased significantly (Fig. 5C).
We performed sub-analyses by different microbial types of bacteria, fungi, and viruses. The performance of fungal detection plateaued at 5 million reads, while the performance of bacterial and viral detection plateaued at 10 and 20 million reads, respectively (Fig. 5D). These observations were also in line with the interpretation that the sensitivity of pathogen metagenomics decreases as the size of the microbial genome decreases (virus<bacterium<fungus), as smaller genomes result in fewer numbers of nucleic acid fragments that can be sequenced.
These findings suggest read depth as a critical variable that impacts assay Recall when both developing and assessing metagenomic tests. Our results also indicate that although a metagenomic assay requires as few as 0.5 million reads per sample for pathogen detection under optimal conditions, in general, a read depth of 20 million was appropriate under most assay settings.
Assay Precision Is Challenged by Background Microbes
To better understand the causes of FP which we found earlier that substantially impacted assay performance, we further categorized the causes of FP results into four groups: cross-contamination, background microorganisms, species misclassification, and viral typing error (Fig. 6A). Among these, background microbes and misclassification of species were the leading causes of FP results (49% and 39%, respectively). These two causes also varied the most among sites (fig. S8).
We then sought to evaluate how much taxonomical misclassification could be attributed to the alignment algorithms. To ensure comprehensive microbial coverage, we included 100,000 reads each derived from a total of 108 species comprising 62 bacteria, 42 viruses, and 4 fungi into our simulated dataset and compared the alignment methods employed by the sites in this study (bwa, bowtie, and SNAP) [30-32] by measuring the percentages of simulated reads that were correctly or incorrectly classified. We found no significant differences at both the genus and species levels, or by microbial type (fig. S9), implying that the alignment method was not a critical performance-differentiating factor.
To gain more insights into FP results caused by background microorganisms, we compared the background patterns from all sites (Fig. 6B). Nine prevalent microorganisms were presented in >5 sites, while others were more site-specific (Fig. 6C). Background microbial patterns clustered partially depending on the methods of library construction and nucleic acid extraction used (Fig. 6B). These findings implied that microbial backgrounds can be derived from both common and workflow-specific sources and that addressing such issues to improve assay Precision would require site-dependent approaches.
Besides read counts, we also examined whether genome coverage and regional sequencing depth could be informative in discriminating TP from FN. We defined genome coverage as the fraction of genome covered by metagenomic sequencing, and the regional sequencing depth as the total sequencing length divided by the covered genomic fraction. TP results were associated with significantly lower regional sequencing depth and higher genome coverage (fig. S10), which was consistent with the fact that these microorganisms exist in the samples as full and uniform genomes. Similar observations were also made for FN results that were missed originally but discovered by our site-independent bioinformatics pipeline. All FP detections showed significantly lower levels of genome coverage. A significant increase in the level of regional depth was also found in background microbes, implying that they presented as genomic fragments instead of full microbial bodies or genomes in the samples.
Taking all these factors into consideration, we built a Precision filter that identified potential FP results through machine learning and applied it to the data derived from site C14, the site that had the highest level of FP results. Integrating RPM, RPM ratio (sample:control), and genome coverage, our method reduced FP results from 34 to 11 counts (Fig. 6D). Importantly, applying such a filter did not compromise Recall, suggesting a potential strategy for improving the precision of pathogen metagenomics.