An Advanced Proteomics Platform Ensures High Quantitative Precision in Large-Scale DIA MS Data
To address unwanted variations arising from the DIA multiple protein spectra library searching approach, we developed STAVER, an algorithm that improves protein quantification accuracy and detects biological differences in data-independent acquisition (DIA) proteomics using standard datasets. STAVER consists of two main modules: the high-confidence peptide identification (High-CI-Peptides) module and the protein quantification (Peptide-to-Protein inference) module (Fig. 1A-1B). The peptide identification module performs an analysis of a matrix of n standard mass spectrometry data, which is obtained by applying DIA-MS to n samples. Each mass spectrometry data is then searched against N spectral libraries generated from different sources or instruments. This process ultimately produces an n × N peptide matrix, which contains the peptide scores for each mass spectrometry data and each spectral library. The peptide identification module then uses a statistical method to identify high-confidence peptides and the spectral libraries corresponding to them by analyzing the peptide distribution and variation within the n × N peptide matrix. The horizontal axis of the matrix represents each standard mass spectrometry peptide datum, and the vertical axis represents the corresponding number of spectral libraries. To rigorously assess the confidence level of peptides, we will implement a dual-faceted strategy that integrates metrics related to peptide frequency and variation among a diverse range of samples. A normalized score, ranging from 0 to 1, will be assigned to each peptide, wherein higher scores indicate increased confidence in the peptide's reliability. The identified high-confidence peptides and their corresponding spectral libraries are then used as inputs for the protein quantification module, which performs protein inference and normalization to obtain accurate protein abundance estimates for each sample.
The protein quantification module infers proteins for each sample by selecting top N peptides from the high-confidence spectral library and weighting them with coefficients to account for peptide-level differences (Fig. 1B). The coefficients are calculated based on the variation of each peptide across different samples and spectral libraries. The weighted peptides are then combined to obtain the protein abundance for each sample. The protein abundance estimates are then normalized using a robust regression method to remove unwanted variation caused by technical factors such as instrument batch effects or library quality. In summary, we have developed STAVER, a novel DIA algorithm framework for label-free protein quantification based on standardized datasets that could handle noise in large-scale mass spectrometry data to obtain highly quantitative precision protein data. STAVER improves the accuracy and reproducibility of protein quantification in large-scale DIA-MS studies, enabling the discovery of more biological differences among samples.
To demonstrate the advantages of STAVER in terms of accuracy, sensitivity, specificity, reproducibility, and biological relevance of protein quantification, we constructed a comprehensive and highly heterogeneous library using a large and diverse collection of human samples. The library was constracted based on proteomic experiments form 327 different samples, including 46 plasma samples, 16 serum samples, 17 cerebrospinal fluid (CSF) samples, 17 urine samples, 31 ascites samples, and 12 bile samples and etc. (Supplementary Fig. 1A), as well as 151 cancer tissues sample such as gliomas cancer, esophageal carcinoma, urothelial carcinoma, clear cell renal cell carcinoma and etc.13 (Supplementary Fig. 1B; Supplementary Table 1). These experiments were subjected to label-free protein mass spectrometry analysis using the data-dependent acquisition (DDA) method on a Q Exactive HF-X and Orbitrap Fusion Lumos mass spectrometer. We then used FragPipe software14 to process the raw DDA data and generate a mixed spectral library containing 15,612 proteins and 215,529 peptides. To obtain standardized datasets for DIA analysis, we prepared standard samples from HEK 293T cells and injected them into the same instrument 90 times (reference standard dataset) and 20 times (validation dataset). We acquired DIA data from these datasets using different software tools such as DIA-NN8 or OpenSWATH 12 and subsequently applied STAVER to remove unwanted variation and noise that lacked biological relevance. Finally, we performed biological function analyses on the protein abundance estimates obtained by STAVER, and we also screened for potential protein biomarkers that could distinguish different sample types or disease states.
High Performance of STAVER Algorithm in Removing Unwanted Noise in Proteome Datasets
To assess the ability of the STAVER algorithm to capture and remove noise in large-scale mass spectrometry data, we performed a technical repeat experiment using 20 consecutive injections of standard samples from 293T cells. We applied the STAVER algorithm to the DIA data obtained from these injections and compared the variability and conservation of peptides between the raw and the STAVER-processed data. We calculated the coefficient of variation (CV) of each peptide across the 20 injections and classified them into three categories: highly variable (CV > 0.5), moderately variable (0.3 < CV < 0.5), and highly conservative (CV < 0.3). Our findings indicated that the STAVER algorithm significantly decreased the proportion of highly variable peptides from 36% (13,394) to 2% (1,504) and increased the proportion of highly conservative peptides from 35–94% (Supplementary Table 1), thereby retaining only stable and reliable peptides for protein quantification (Fig. 2A, B). Moreover, we observed that the number of conservative peptides increased from 13,240 to 22,140 after applying the STAVER algorithm, indicating that it effectively eliminating abnormal noise in the data (Fig. 2C-2D; Supplementary Table 1). To assess the STAVER algorithm's impact on protein-level consistency, we performed a comprehensive comparison of protein abundance estimates derived from raw and processed data by the STAVER algorithm using different metrics. For each of the 20 technical repeat injected datasets, we conducted Spearman correlation coefficient analysis among samples to evaluate reproducibility, and applied the coefficient of variation (CV) of each peptide and protein across samples to evaluate the variability. Our findings demonstrated that the STAVER algorithm consistently improved both metrics across all datasets (Fig. 2E). The median Spearman correlation coefficient between technical repeat samples increased from 0.84 (IQR = 0.83–0.85) in raw data to 0.93 (IQR = 0.93–0.94) in STAVER-processed data (Fig. 2E; Supplementary Fig. 1C), indicating higher protein abundance consistency among samples after applying the STAVER algorithm (Supplementary Table 1). The median CV of protein data decreased from 0.623 in raw data to 0.256 in processed data by the STAVER algorithm, indicating lower technical variability among samples after applying the STAVER algorithm. These results demonstrated that the STAVER algorithm can effectively enhance protein-level consistency by minimizing noise and variability in large-scale mass spectrometry data.
A potential concern when applying the STAVER algorithm to mass spectrometry data is whether it would compromise the number and quality of protein identifications by removing noisy peptides. To further validate the impact of STAVER algorithm on proteomic coverage, we performed a detailed analysis of protein identification results obtained from raw and processed data using different metrics. We first compared the dynamic range of protein abundance estimates between raw and processed data using log10 transform. We found that the dynamic range of proteins after applying the STAVER algorithm was consistent with that before applying the STAVER algorithm across all datasets (Fig. 2H), suggesting that the algorithm did not affect the coverage depth or the sensitivity of protein detection. We then examined the number of identified proteins between raw and STAVER-processed data using a stringent false discovery rate (FDR) threshold of 1% to ensure high confidence in protein identification. In our observations, the quantity of proteins remained consistent pre and post application of the STAVER algorithm. The median protein identification for both raw and STAVER-processed data was not showing significantly differences, with values of 1,796 (IQR = 1769–1815) and 1,765 (IQR = 1738–1798) respectively (Fig. 2G; Student's t-test, P-value = 0.11). This observation was further corroborated using the publicly available PXD018874 dataset, where again no substantial differences were noted between the raw data and the data processed by STAVER (Supplementary Fig. 2D; Student's t-test, P-value = 0.12). These results underscored that the algorithm did not introduce any significant loss or bias in protein identification. Finally, we assessed the quality of protein identifications by comparing sequence coverage and spectral counts. Sequence coverage reflects the proportion of amino acids in a protein sequence covered by identified peptides, while spectral counts reflect the number of spectra assigned to a protein or its peptides. Both metrics are commonly used to assess the confidence and reliability of protein identification and quantification in mass spectrometry studies. We calculated these metrics for each identified protein in each dataset and compared them between raw and processed data (Supplementary Table 1). In order to evaluate the proteins, we will amalgamate the diverse metrics and allocate a normalized score, ranging from 0 to 1, to each peptide. Following this, the proteins will be further ranked according to the scores attributed to their corresponding peptides, thereby facilitating a comprehensive comparison of their relative importance. We found that the STAVER algorithm improved these metrics for most proteins in all datasets (Supplementary Fig. 1D-1E; Supplementary Table 1), demonstrating that the algorithm did not compromise the quality or accuracy of protein identifications. In summary, these results showed that the STAVER algorithm can effectively reduce unwanted variability and enhance data quality at the protein level without compromising the number or quality of protein identifications. These findings highlight the potential utility of the STAVER algorithm in large-scale mass spectrometry studies.
Robust generalization performance of the STAVER algorithm Across diverse platform output datasets
To evaluate the generalization performance of the STAVER algorithm on data generated from different platforms, we performed a quality control (QC) experiment using eight quality control samples. These samples were created by mixing plasma samples from multiple sources, thereby providing a means for inter-sample quality control evaluation. To simulate batch effects, these samples were inserted into the mass spectrometry analysis of the formal experimental samples at different time points (every three days). The mass spectrometry analysis was conducted with the orbitrap fusion Lumos instrument, which is a different platform from the one used to generate the training data for the STAVER algorithm. We applied the STAVER algorithm to these eight data and conducted Spearman correlation analysis to evaluate the correlation coefficiencies among raw protein quantitive abundance and among STAVER algorithm calibrated protein quantitive abundance, seperately. As a result, we observed that by utilizing the STAVER algorithm corrected protein quantitive abundance, the correlation coeeficiencies among samples were significantly higher than using raw protein quantitive abundance (Wilcoxon p-value < 0.001) (Fig. 3A). To be more specific, after STAVER processing, the median protein abundance correlation between samples after STAVER processing was 0.92 (interquartile range [IQR], IQR = 0.90–0.93), whereas that raw data was 0.75 (IQR = 0.74–0.77), indicating that the STAVER algorithm significantly improved the reproducibility of DIA protein data (Fig. 3B; Supplementary Table 2). This observation was further confirmed using the publicly available PXD018874 dataset, which demonstrated better reproducibility between STAVER-processed data [0.94 (IQR = 0.93–0.96)] compared to raw data [0.90 (IQR = 0.89–0.92)] (Supplementary Fig. 2A-2C).
Likewise, compared with utilizing raw protein data, using STAVER algorithm calibrated protein quantitive abundance could significantly decreased the coefficients of variation (CVs) among samples. The average CVs of raw and processed protein data were 0.71 and 0.34, respectively (Fig. 3C; Supplementary Table 2). We further investigated whether STAVER affected the quantity and quality of protein identification across platforms, and we compared the number of proteins with moderate (0.3 < CV < 0.5) and low (CV < 0.3) variability among the eight QC samples. We defined proteins with lower variability as consistent proteins (CV < 0.5), and assumes that they would be more reliably detected and quantified in different samples. Compared to 37.4% of the raw data, after applying STAVER, the proportion of consistent proteins increased to 73.9% of the proteins (1,836) (Fig. 3E; Supplementary Table 2), indicating that STAVER significantly increased the number of consistent proteins among biological repeats. In addition, we observed that only 20.5% (462) of the proteins in the raw protein data had CVs below 0.3. In comparison, 48.2% (1,197) of the STAVER-processed protein proteins had CVs below 0.3, increasing the number of proteins from 462 proteins in the raw data to 1,197 proteins in the STAVER-processed data, indicating high reproducibility and robustness to batch effects and platform differences (Fig. 3E). These results suggested that STAVER can be well generalized to DIA mass spectrometry data generated by other platforms and can effectively eliminate unwanted noise information from DIA data, thus improving the accuracy and reproducibility of the data.
We further applied the STAVER algorithm to these eight DIA data to assess this improvement. Importantly, we observed that the application of the STAVER algorithm did not affect the identification depth of proteins (Fig. 3D), as the dynamic range of proteins in raw data remained consistent with that in the processed data using the STAVER algorithm. Moreover, the dynamic range of proteins spans seven orders of magnitude, indicating that this improvement did not compromise the quality of protein identification. To assess the impact of the STAVER algorithm on the quantification of specific plasma proteins, we selected three representative plasma proteins with abundances spanning six orders of magnitude, including C1R, C8G, and F5 proteins. The abundances of these proteins in the raw data exhibited large variability among QC repeats, with the largest differences exceeding ten-fold. After processing by the STAVER algorithm, the abundances of C1R, C8G, and F5 protein showed no variation in these eight data, resulting in highly reproducible protein quantification (Fig. 3G; Supplementary Table 2). We compared different conditions by using the proportion of proteins with CVs less than 20%, as this is a standard threshold in vitro diagnostic assays15. Notably, compared with only 12% proteins showed CVs less than 20%, 30% of quantified proteins fell within this threshold range among protein data processed by the STAVER algorithm (Fig. 3F; Supplementary Table 2). These results illustrated that the STAVER algorithm could effectively calibrate systematic errors caused by different LC-MS/MS conditions and improve the accuracy and precision of protein quantification in DIA datasets.
In summary, our study demonstrated the robust generalization performance of the STAVER algorithm across diverse platform output datasets. The algorithm effectively reduced unwanted noise information and enhanced data accuracy and reproducibility, without compromising the quality or depth of protein identification. Additionally, the STAVER algorithm significantly improved the consistency of protein quantification in DIA datasets, exhibiting high reproducibility and robustness to batch effects and platform differences. These findings highlight the potential utility of the STAVER algorithm in large-scale mass spectrometry studies, particularly in the context of DIA mass spectrometry data generated by various platforms.
STAVER algorithm harmonizes heterogeneous results from multicenter disease studies.
The increasing availability of DIA proteomic data requires us to coordinate DIA datasets across different platforms to ensure consistency and comparability of results that reflect biological differences and analyses rather than technical variations16,17. This coordination is critical for understanding of complex biological systems and processes, especially in emerging diseases such as COVID-1918. To determine the ability of the STAVER algorithm to unify biologically relevant differences and findings across datasets from different platforms, we applied it to four previously published COVID-19 plasma DIA proteomics datasets generated by different laboratories and batches: including IPX000218600113, IPX000292400119, PXD02575220 and PXD01887421. These datasets were downloaded from public repositories and reprocessed with the same unified standard using the same search software and parameters to eliminate potential errors caused by inconsistent data processing, such as different search software or parameters (Materials and Methods). We then performed principal component analysis (PCA) to assess the separation between different disease stages based on protein abundance profiles. We plotted each dataset's first principal components (PC1 and PC2) and compared them between raw and processed data. Compared to the raw DIA produced proteomic data, we found that the STAVER algorithm improved the separation between different disease stages in DIA proteome data processed by the STAVER algorithm (Fig. 4A; Supplementary Fig. 2G), indicating that it could improve the biological signal by removing unwanted noise from non-biological variations, thus enabling more accurate capture of biological differences between samples.
Moreover, the STAVER algorithm efficiently reduced the CVs among samples for each protein across all datasets. CVs are commonly used to measure the dispersion or variability of a distribution relative to its mean. Lower CVs indicate higher reproducibility and reliability of protein DIA data across different laboratories and batches. Therefore, our results demonstrated that the STAVER algorithm could increase the reproducibility and reliability of protein DIA data across different laboratories and batches (Fig. 4B). We further investigated the biological performance of the STAVER algorithm for COVID-19 disease by performing differential expression protein analysis across various disease stages for four datasets. Using the limma package22, a well-established tool for differential expression analysis of omics data, we perform differential protein analysis among groups, using the raw DIA proteomic data and STAVER-processed data. In the raw dataset, the distributions were observed to be more heterogeneous, exhibiting characteristics such as bimodal patterns. However, upon applying the STAVER algorithm, a notable improvement in data quality was observed. The processed data using STAVER demonstrated more consistent density distributions of significance across various experiments, effectively eliminating any abnormal peaks. This enhanced uniformity in the data highlights the efficacy of the STAVER algorithm in refining raw datasets for further analysis in the realm of biotechnology. (Fig. 4C; Supplementary Table 3). These results indicated that the STAVER algorithm can effectively reduce non-biological variation caused by batch effects or instrument drifts and enhance the accuracy and reproducibility of protein quantification. To further assess the biological relevance of the STAVER algorithm on protein quantification, we compared the significantly differential proteins between COVID-19 plasma and healthy population in two independent cohorts (IPX0002186001 and IPX0002924001). Upon examination of the raw data, we found that a mere 10.9% of proteins exhibiting significant differences were shared between both cohorts. In contrast, after processing the data using the STAVER algorithm, we observed a substantial increase in the detection of differentially expressed proteins (DEPs): proteins with significant differences (determined by the Wilcoxon rank-sum test; p-value < 0.05) between both cohorts. The proportion of shared DEPs increased to 27.3% consequently, highlighting the algorithm’s effectiveness in enhancing data quality and revealing more meaningful relationships between the cohorts (Fig. 4D). This suggested that STAVER algorithm could improve the consistency and comparability of results across different platforms by minimizing technical variations and highlighting biological differences. We then compared the differential expressed proteins between different disease stages in three cohorts (IPX0002186001, PXD025752, and PXD018874). We detected that 6.2% (59) of the proteins with significant differences identified from STAVER-processed data were consistent in all three cohorts. In comparison, only 0.7% (20) of such proteins were shared from three cohorts in the raw data (Fig. 4E). This implied that the STAVER algorithm could increase the convergence effect on differential proteins by eliminating unwanted noise from non-biological variations across different platforms.
Identifying plasma protein biomarkers for patients with COVID-19 is crucial for diagnosing, evaluating prognosis, and treating this disease23,24. Previous studies have clinically validated four plasma proteins (albumin [ALB], selenoprotein P [SELENOP], antithrombin III [SERPINC1] and platelet factor 4 [PF4]) as biomarkers for COVID-19 severity and mortality (Supplementary Table 3). These proteins are generally downregulated in COVID-19 patients compared to healthy controls, reflecting the systemic inflammation and coagulation dysfunction caused by the viral infection. However, the discovery of COVID-19 plasma protein biomarkers is hampered by the heterogeneity and inconsistency of data acquisition platforms across different cohorts and laboratories. For instance, the raw quantitive proteomic data from two independent cohorts (IPX0002924001 and IPX0002186001) that used different data-independent acquisition (DIA) mass spectrometry platforms showed contradictory results for these four proteins, with some of them being upregulated in one cohort but not the other (Fig. 4F; Supplementary Table 3). This inconsistency undermines the biological validity and reproducibility of the findings and limits their clinical utility. We then conducted STAVER algorithm to standardize the protein quantification across different platforms and to filter out unreliable or spurious signals from the data. After applying the STAVER algorithm to the raw data from the two cohorts, we observed that all four proteins showed a consistent trend of downregulation and significant biological differences in both cohorts, in agreement with the clinical criteria (Fig. 4F).
Similarly, two other COVID-19 plasma protein markers (transforming growth factor beta-induced protein [TGFBI]25 and amine oxidase copper-containing 3 [AOC3]23), which are usually upregulated in patients due to their roles in tissue remodeling and vascular inflammation, also showed inconsistent results in the raw when using raw quantitive proteomic data from the two cohorts (Fig. 4F; Supplementary Table 3). The STAVER algorithm corrected this inconsistency and revealed that both proteins were upregulated and had significant biological differences in both cohorts. These results demonstrated the power of the STAVER algorithm to unify DIA data from different platforms and improve the biological comparability and reproducibility of COVID-19 biomarker discovery.
In conclusion, the STAVER algorithm has demonstrated its ability to standardize and unify DIA proteomic data from different platforms, resulting in improved biological signal by removing unwanted noise from non-biological variations. The algorithm also increased the reproducibility and reliability of protein quantification, thereby enhanced the accuracy of differential protein analysis. Furthermore, the application of the STAVER algorithm to four published COVID-19 plasma DIA proteomics datasets has enabled the discovery of consistent and reliable plasma protein biomarkers across different cohorts and laboratories, demonstrating its potential clinical utility. In summary, the STAVER algorithm represented a powerful tool for improving the comparability and reproducibility of DIA proteomic data, facilitating the understanding of complex biological systems and processes, and accelerating the discovery of disease biomarkers.