STAVER: A Standardized Benchmark Dataset-Based Algorithm for Effective Variation Reduction in Large-Scale DIA MS Data

doi:10.21203/rs.3.rs-3111384/v1

Download PDF

Article

STAVER: A Standardized Benchmark Dataset-Based Algorithm for Effective Variation Reduction in Large-Scale DIA MS Data

https://doi.org/10.21203/rs.3.rs-3111384/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Mass spectrometry-based proteomics has emerged as a powerful tool for the comprehensive investigation of complex biological systems. Data-independent acquisition (DIA) mass spectrometry enables the simultaneous quantification of thousands of proteins, with multi- spectral library search strategies showing great promise for enhancing protein identification and quantification. However, the presence of poor-quality profiles can considerably impact the accuracy of quantitative results, leading to erroneous protein quantification. To address this challenge, we developed STAVER, a standardized benchmark dataset-based algorithm efficiently reduces variation in large-scale DIA MS data. By using the benchmark dataset to standardize mass spectrometry signals, STAVER effectively removes unwanted noise and enhances protein quantification accuracy, especially in the context of multi-spectral library searching. We validated the effectiveness of STAVER in several large-scale DIA datasets, demonstrating improved identification and quantification of thousands of proteins. STAVER represents an innovative and efficacious approach for removing unwanted noise information in large-scale DIA proteome data. It enables cross-study comparison and integration of DIA datasets across different platforms and laboratories, enhancing the consistency and reproducibility of clinical research findings. The complete package is accessible online at https://github.com/Ran485/STAVER.

Biological sciences/Biotechnology/Proteomics

Biological sciences/Biological techniques/Bioinformatics

Data-independent acquisition (DIA) is a mass spectrometry-based method that has emerged as a promising technology in proteomics research due to its ability to unbiased and comprehensively detects and quantify proteins in complex biological samples. The DIA method evolved from the traditional data-dependent acquisition (DDA) technique¹, which selects the most intense peaks in a mass spectrometry scan for analysis with reproducibility and proteome coverage limitations. In contrast, the DIA method fragments all precursor ions within a specified mass range to generate a comprehensive library of fragment ions, which is used to identify and quantify almost all proteins in a sample, improving proteome coverage depth and providing a more comprehensive analysis^2,3. The field of proteomics has witnessed the rapid development of DIA technology, leading to its increasingly large-scale application in essential clinical research⁴. As research continues to progress, DIA technology is further developing and flourishing, providing new opportunities for discovery and innovation. DIA has several advantages over data-dependent acquisition (DDA), such as high sensitivity and reproducibility⁵. However, DIA also faces various challenges affecting protein identification and quantification accuracy, especially in large-scale studies. These challenges include sample preparation, instrument variability, high inter-run variation, and data processing^5,6. Therefore, addressing these challenges is essential for the ongoing development of DIA technology.

Proteome analysis is a complex task that requires high sensitivity, specificity, and throughput. DIA technology is a mass spectrometry method that aims to overcome these challenges by fragmenting and analyzing all ions within a selected m/z range⁷. However, DIA data analysis poses several difficulties, such as interference correction, peak detection and alignment, spectral library generation, and peptide and protein identification and quantification⁸. To address these challenges, various DIA algorithms have been developed over the years with different features and advantages. To be more specific, one of the first DIA algorithms was SWATH-MS (sequential window acquisition of all theoretical fragment ion spectra mass spectrometry), which divided the mass-to-charge range into multiple windows and acquired all fragment ions within each window⁷. SWATH-MS was highly reproducible and accurate, but it required a comprehensive spectral library and had a limited dynamic range for high-abundance peptides⁹. Other DIA algorithms include SONAR (sequential window acquisition of all theoretical spectra with ion mobility separation), which used ion mobility separation to reduce signal interferences and increase sensitivity¹⁰, DIA-Umpire (a universal method for data-independent acquisition using computational deconvolution), which did not require a spectral library but relied on computational deconvolution to identify peptides from complex mixtures¹¹, Skyline (a targeted proteomics environment), applied used a comprehensive spectral library for targeted extraction of peptide ions and corresponding fragment ions⁹, OpenSWATH (an open-source software tool for SWATH-MS data analysis), which combined the best features of SWATH-MS and SONAR methods, utilized a machine learning-based targeted approach for peptide and fragment ion extraction, and employed advanced algorithms for peak detection and alignment¹², and DIA-NN (data-independent acquisition by neural networks), which applied neural networks for interference correction, peak detection, alignment, spectral library generation, peptide identification and quantification⁸. DIA-NN was capable of handling samples with diverse peptide and protein compositions with high accuracy and sensitivity, making it a valuable tool for analyzing low-abundance molecules in complex mixtures⁸. Overall, the evolution of proteome DIA algorithm techniques has been driven by the need for improved sensitivity, specificity, and throughput and advancements in computational technologies.

The development of DIA technology has undergone three stages of advancement. In the first stage, the samples and the protein spectral libraries were analyzed in one batch, resulting in accurate but limited protein identification and complex and time-consuming data analysis. In the second stage, multiple protein spectral libraries were used for the same samples, increasing the quantity of protein identification but decreased its accuracy accordingly. The data complexity also increased further, as did the time cost for data analysis. In the third stage, the protein spectral libraries and the samples could be separated, and multiple protein spectral libraries could be applied to different samples from diverse sources or instruments. This made the operation more convenient, significantly improved the quantity and efficiency of protein identification, and significantly increased its efficiency; for example, using the DIA-NN method enhanced the number of identified proteins per sample⁸. Data-independent acquisition (DIA) proteomics allows simultaneous identification and quantification of proteins from multiple spectral libraries, which could increase protein coverage and capture more biological differences. Moreover, this multi-spectral library searching approach has become a trend of application in proteomics.

However, this multi-spectral library searching approach also introduced unwanted noise and variation in the data, compromising protein quantification accuracy. This challenge is exacerbated in large-scale DIA-MS studies that involve multiple protein spectral libraries generated from different sources or instruments, which could suffer from high coefficients of variation (CVs) across samples, leading to a decrease in the accuracy of protein identification and quantification. To address this challenge, we developed STAVER, a novel algorithm that efficiently removes unwanted variation from large-scale DIA MS data based on reference standardized datasets. We applied STAVER to several large-scale DIA-MS datasets and demonstrated its ability in improving protein quantification accuracy and reflecting the biological differences. STAVER separated the protein spectral library and the samples, allowing multiple spectral libraries to be applied to different platforms and laboratory outputs of DIA mass spectrometry data, improving the convenience and efficiency of the operation. Moreover, the modular design of STAVER ensures seamless integration with existing DIA-MS data analysis pipelines and enables its extension and enhancement for new proteomic analysis tasks. Our findings indicated that STAVER has the potential to become a valuable tool for quantitative proteomics studies.

In conclusion, our study highlights the importance of addressing variability in large-scale DIA-MS studies and presents a novel algorithm that could reduce unwanted variation, thereby improving the accuracy and reproducibility of protein quantification. Employing STAVER in DIA-based proteomics studies is anticipated to enhance our understanding of complex biological systems and facilitate the discovery of novel biomarkers, therapeutic targets, and mechanistic insights. Additionally, the 90-sample DIA benchmark dataset and newly generated 327 comprehensive protein spectral libraries offer invaluable resources for facilitating DIA proteomic analyses. It is expected that STAVER will become a crucial tool in the continually evolving field of proteomics, promoting further advancements in both technological development and biological discovery.

An Advanced Proteomics Platform Ensures High Quantitative Precision in Large-Scale DIA MS Data

To address unwanted variations arising from the DIA multiple protein spectra library searching approach, we developed STAVER, an algorithm that improves protein quantification accuracy and detects biological differences in data-independent acquisition (DIA) proteomics using standard datasets. STAVER consists of two main modules: the high-confidence peptide identification (High-CI-Peptides) module and the protein quantification (Peptide-to-Protein inference) module (Fig. 1A-1B). The peptide identification module performs an analysis of a matrix of n standard mass spectrometry data, which is obtained by applying DIA-MS to n samples. Each mass spectrometry data is then searched against N spectral libraries generated from different sources or instruments. This process ultimately produces an n × N peptide matrix, which contains the peptide scores for each mass spectrometry data and each spectral library. The peptide identification module then uses a statistical method to identify high-confidence peptides and the spectral libraries corresponding to them by analyzing the peptide distribution and variation within the n × N peptide matrix. The horizontal axis of the matrix represents each standard mass spectrometry peptide datum, and the vertical axis represents the corresponding number of spectral libraries. To rigorously assess the confidence level of peptides, we will implement a dual-faceted strategy that integrates metrics related to peptide frequency and variation among a diverse range of samples. A normalized score, ranging from 0 to 1, will be assigned to each peptide, wherein higher scores indicate increased confidence in the peptide's reliability. The identified high-confidence peptides and their corresponding spectral libraries are then used as inputs for the protein quantification module, which performs protein inference and normalization to obtain accurate protein abundance estimates for each sample.

The protein quantification module infers proteins for each sample by selecting top N peptides from the high-confidence spectral library and weighting them with coefficients to account for peptide-level differences (Fig. 1B). The coefficients are calculated based on the variation of each peptide across different samples and spectral libraries. The weighted peptides are then combined to obtain the protein abundance for each sample. The protein abundance estimates are then normalized using a robust regression method to remove unwanted variation caused by technical factors such as instrument batch effects or library quality. In summary, we have developed STAVER, a novel DIA algorithm framework for label-free protein quantification based on standardized datasets that could handle noise in large-scale mass spectrometry data to obtain highly quantitative precision protein data. STAVER improves the accuracy and reproducibility of protein quantification in large-scale DIA-MS studies, enabling the discovery of more biological differences among samples.

To demonstrate the advantages of STAVER in terms of accuracy, sensitivity, specificity, reproducibility, and biological relevance of protein quantification, we constructed a comprehensive and highly heterogeneous library using a large and diverse collection of human samples. The library was constracted based on proteomic experiments form 327 different samples, including 46 plasma samples, 16 serum samples, 17 cerebrospinal fluid (CSF) samples, 17 urine samples, 31 ascites samples, and 12 bile samples and etc. (Supplementary Fig. 1A), as well as 151 cancer tissues sample such as gliomas cancer, esophageal carcinoma, urothelial carcinoma, clear cell renal cell carcinoma and etc.¹³ (Supplementary Fig. 1B; Supplementary Table 1). These experiments were subjected to label-free protein mass spectrometry analysis using the data-dependent acquisition (DDA) method on a Q Exactive HF-X and Orbitrap Fusion Lumos mass spectrometer. We then used FragPipe software¹⁴ to process the raw DDA data and generate a mixed spectral library containing 15,612 proteins and 215,529 peptides. To obtain standardized datasets for DIA analysis, we prepared standard samples from HEK 293T cells and injected them into the same instrument 90 times (reference standard dataset) and 20 times (validation dataset). We acquired DIA data from these datasets using different software tools such as DIA-NN⁸ or OpenSWATH ¹² and subsequently applied STAVER to remove unwanted variation and noise that lacked biological relevance. Finally, we performed biological function analyses on the protein abundance estimates obtained by STAVER, and we also screened for potential protein biomarkers that could distinguish different sample types or disease states.

High Performance of STAVER Algorithm in Removing Unwanted Noise in Proteome Datasets

To assess the ability of the STAVER algorithm to capture and remove noise in large-scale mass spectrometry data, we performed a technical repeat experiment using 20 consecutive injections of standard samples from 293T cells. We applied the STAVER algorithm to the DIA data obtained from these injections and compared the variability and conservation of peptides between the raw and the STAVER-processed data. We calculated the coefficient of variation (CV) of each peptide across the 20 injections and classified them into three categories: highly variable (CV > 0.5), moderately variable (0.3 < CV < 0.5), and highly conservative (CV < 0.3). Our findings indicated that the STAVER algorithm significantly decreased the proportion of highly variable peptides from 36% (13,394) to 2% (1,504) and increased the proportion of highly conservative peptides from 35–94% (Supplementary Table 1), thereby retaining only stable and reliable peptides for protein quantification (Fig. 2A, B). Moreover, we observed that the number of conservative peptides increased from 13,240 to 22,140 after applying the STAVER algorithm, indicating that it effectively eliminating abnormal noise in the data (Fig. 2C-2D; Supplementary Table 1). To assess the STAVER algorithm's impact on protein-level consistency, we performed a comprehensive comparison of protein abundance estimates derived from raw and processed data by the STAVER algorithm using different metrics. For each of the 20 technical repeat injected datasets, we conducted Spearman correlation coefficient analysis among samples to evaluate reproducibility, and applied the coefficient of variation (CV) of each peptide and protein across samples to evaluate the variability. Our findings demonstrated that the STAVER algorithm consistently improved both metrics across all datasets (Fig. 2E). The median Spearman correlation coefficient between technical repeat samples increased from 0.84 (IQR = 0.83–0.85) in raw data to 0.93 (IQR = 0.93–0.94) in STAVER-processed data (Fig. 2E; Supplementary Fig. 1C), indicating higher protein abundance consistency among samples after applying the STAVER algorithm (Supplementary Table 1). The median CV of protein data decreased from 0.623 in raw data to 0.256 in processed data by the STAVER algorithm, indicating lower technical variability among samples after applying the STAVER algorithm. These results demonstrated that the STAVER algorithm can effectively enhance protein-level consistency by minimizing noise and variability in large-scale mass spectrometry data.

A potential concern when applying the STAVER algorithm to mass spectrometry data is whether it would compromise the number and quality of protein identifications by removing noisy peptides. To further validate the impact of STAVER algorithm on proteomic coverage, we performed a detailed analysis of protein identification results obtained from raw and processed data using different metrics. We first compared the dynamic range of protein abundance estimates between raw and processed data using log10 transform. We found that the dynamic range of proteins after applying the STAVER algorithm was consistent with that before applying the STAVER algorithm across all datasets (Fig. 2H), suggesting that the algorithm did not affect the coverage depth or the sensitivity of protein detection. We then examined the number of identified proteins between raw and STAVER-processed data using a stringent false discovery rate (FDR) threshold of 1% to ensure high confidence in protein identification. In our observations, the quantity of proteins remained consistent pre and post application of the STAVER algorithm. The median protein identification for both raw and STAVER-processed data was not showing significantly differences, with values of 1,796 (IQR = 1769–1815) and 1,765 (IQR = 1738–1798) respectively (Fig. 2G; Student's t-test, P-value = 0.11). This observation was further corroborated using the publicly available PXD018874 dataset, where again no substantial differences were noted between the raw data and the data processed by STAVER (Supplementary Fig. 2D; Student's t-test, P-value = 0.12). These results underscored that the algorithm did not introduce any significant loss or bias in protein identification. Finally, we assessed the quality of protein identifications by comparing sequence coverage and spectral counts. Sequence coverage reflects the proportion of amino acids in a protein sequence covered by identified peptides, while spectral counts reflect the number of spectra assigned to a protein or its peptides. Both metrics are commonly used to assess the confidence and reliability of protein identification and quantification in mass spectrometry studies. We calculated these metrics for each identified protein in each dataset and compared them between raw and processed data (Supplementary Table 1). In order to evaluate the proteins, we will amalgamate the diverse metrics and allocate a normalized score, ranging from 0 to 1, to each peptide. Following this, the proteins will be further ranked according to the scores attributed to their corresponding peptides, thereby facilitating a comprehensive comparison of their relative importance. We found that the STAVER algorithm improved these metrics for most proteins in all datasets (Supplementary Fig. 1D-1E; Supplementary Table 1), demonstrating that the algorithm did not compromise the quality or accuracy of protein identifications. In summary, these results showed that the STAVER algorithm can effectively reduce unwanted variability and enhance data quality at the protein level without compromising the number or quality of protein identifications. These findings highlight the potential utility of the STAVER algorithm in large-scale mass spectrometry studies.

Robust generalization performance of the STAVER algorithm Across diverse platform output datasets

To evaluate the generalization performance of the STAVER algorithm on data generated from different platforms, we performed a quality control (QC) experiment using eight quality control samples. These samples were created by mixing plasma samples from multiple sources, thereby providing a means for inter-sample quality control evaluation. To simulate batch effects, these samples were inserted into the mass spectrometry analysis of the formal experimental samples at different time points (every three days). The mass spectrometry analysis was conducted with the orbitrap fusion Lumos instrument, which is a different platform from the one used to generate the training data for the STAVER algorithm. We applied the STAVER algorithm to these eight data and conducted Spearman correlation analysis to evaluate the correlation coefficiencies among raw protein quantitive abundance and among STAVER algorithm calibrated protein quantitive abundance, seperately. As a result, we observed that by utilizing the STAVER algorithm corrected protein quantitive abundance, the correlation coeeficiencies among samples were significantly higher than using raw protein quantitive abundance (Wilcoxon p-value < 0.001) (Fig. 3A). To be more specific, after STAVER processing, the median protein abundance correlation between samples after STAVER processing was 0.92 (interquartile range [IQR], IQR = 0.90–0.93), whereas that raw data was 0.75 (IQR = 0.74–0.77), indicating that the STAVER algorithm significantly improved the reproducibility of DIA protein data (Fig. 3B; Supplementary Table 2). This observation was further confirmed using the publicly available PXD018874 dataset, which demonstrated better reproducibility between STAVER-processed data [0.94 (IQR = 0.93–0.96)] compared to raw data [0.90 (IQR = 0.89–0.92)] (Supplementary Fig. 2A-2C).

Likewise, compared with utilizing raw protein data, using STAVER algorithm calibrated protein quantitive abundance could significantly decreased the coefficients of variation (CVs) among samples. The average CVs of raw and processed protein data were 0.71 and 0.34, respectively (Fig. 3C; Supplementary Table 2). We further investigated whether STAVER affected the quantity and quality of protein identification across platforms, and we compared the number of proteins with moderate (0.3 < CV < 0.5) and low (CV < 0.3) variability among the eight QC samples. We defined proteins with lower variability as consistent proteins (CV < 0.5), and assumes that they would be more reliably detected and quantified in different samples. Compared to 37.4% of the raw data, after applying STAVER, the proportion of consistent proteins increased to 73.9% of the proteins (1,836) (Fig. 3E; Supplementary Table 2), indicating that STAVER significantly increased the number of consistent proteins among biological repeats. In addition, we observed that only 20.5% (462) of the proteins in the raw protein data had CVs below 0.3. In comparison, 48.2% (1,197) of the STAVER-processed protein proteins had CVs below 0.3, increasing the number of proteins from 462 proteins in the raw data to 1,197 proteins in the STAVER-processed data, indicating high reproducibility and robustness to batch effects and platform differences (Fig. 3E). These results suggested that STAVER can be well generalized to DIA mass spectrometry data generated by other platforms and can effectively eliminate unwanted noise information from DIA data, thus improving the accuracy and reproducibility of the data.

We further applied the STAVER algorithm to these eight DIA data to assess this improvement. Importantly, we observed that the application of the STAVER algorithm did not affect the identification depth of proteins (Fig. 3D), as the dynamic range of proteins in raw data remained consistent with that in the processed data using the STAVER algorithm. Moreover, the dynamic range of proteins spans seven orders of magnitude, indicating that this improvement did not compromise the quality of protein identification. To assess the impact of the STAVER algorithm on the quantification of specific plasma proteins, we selected three representative plasma proteins with abundances spanning six orders of magnitude, including C1R, C8G, and F5 proteins. The abundances of these proteins in the raw data exhibited large variability among QC repeats, with the largest differences exceeding ten-fold. After processing by the STAVER algorithm, the abundances of C1R, C8G, and F5 protein showed no variation in these eight data, resulting in highly reproducible protein quantification (Fig. 3G; Supplementary Table 2). We compared different conditions by using the proportion of proteins with CVs less than 20%, as this is a standard threshold in vitro diagnostic assays¹⁵. Notably, compared with only 12% proteins showed CVs less than 20%, 30% of quantified proteins fell within this threshold range among protein data processed by the STAVER algorithm (Fig. 3F; Supplementary Table 2). These results illustrated that the STAVER algorithm could effectively calibrate systematic errors caused by different LC-MS/MS conditions and improve the accuracy and precision of protein quantification in DIA datasets.

In summary, our study demonstrated the robust generalization performance of the STAVER algorithm across diverse platform output datasets. The algorithm effectively reduced unwanted noise information and enhanced data accuracy and reproducibility, without compromising the quality or depth of protein identification. Additionally, the STAVER algorithm significantly improved the consistency of protein quantification in DIA datasets, exhibiting high reproducibility and robustness to batch effects and platform differences. These findings highlight the potential utility of the STAVER algorithm in large-scale mass spectrometry studies, particularly in the context of DIA mass spectrometry data generated by various platforms.

STAVER algorithm harmonizes heterogeneous results from multicenter disease studies.

The increasing availability of DIA proteomic data requires us to coordinate DIA datasets across different platforms to ensure consistency and comparability of results that reflect biological differences and analyses rather than technical variations^16,17. This coordination is critical for understanding of complex biological systems and processes, especially in emerging diseases such as COVID-19¹⁸. To determine the ability of the STAVER algorithm to unify biologically relevant differences and findings across datasets from different platforms, we applied it to four previously published COVID-19 plasma DIA proteomics datasets generated by different laboratories and batches: including IPX0002186001¹³, IPX0002924001¹⁹, PXD025752²⁰ and PXD018874²¹. These datasets were downloaded from public repositories and reprocessed with the same unified standard using the same search software and parameters to eliminate potential errors caused by inconsistent data processing, such as different search software or parameters (Materials and Methods). We then performed principal component analysis (PCA) to assess the separation between different disease stages based on protein abundance profiles. We plotted each dataset's first principal components (PC1 and PC2) and compared them between raw and processed data. Compared to the raw DIA produced proteomic data, we found that the STAVER algorithm improved the separation between different disease stages in DIA proteome data processed by the STAVER algorithm (Fig. 4A; Supplementary Fig. 2G), indicating that it could improve the biological signal by removing unwanted noise from non-biological variations, thus enabling more accurate capture of biological differences between samples.

Moreover, the STAVER algorithm efficiently reduced the CVs among samples for each protein across all datasets. CVs are commonly used to measure the dispersion or variability of a distribution relative to its mean. Lower CVs indicate higher reproducibility and reliability of protein DIA data across different laboratories and batches. Therefore, our results demonstrated that the STAVER algorithm could increase the reproducibility and reliability of protein DIA data across different laboratories and batches (Fig. 4B). We further investigated the biological performance of the STAVER algorithm for COVID-19 disease by performing differential expression protein analysis across various disease stages for four datasets. Using the limma package²², a well-established tool for differential expression analysis of omics data, we perform differential protein analysis among groups, using the raw DIA proteomic data and STAVER-processed data. In the raw dataset, the distributions were observed to be more heterogeneous, exhibiting characteristics such as bimodal patterns. However, upon applying the STAVER algorithm, a notable improvement in data quality was observed. The processed data using STAVER demonstrated more consistent density distributions of significance across various experiments, effectively eliminating any abnormal peaks. This enhanced uniformity in the data highlights the efficacy of the STAVER algorithm in refining raw datasets for further analysis in the realm of biotechnology. (Fig. 4C; Supplementary Table 3). These results indicated that the STAVER algorithm can effectively reduce non-biological variation caused by batch effects or instrument drifts and enhance the accuracy and reproducibility of protein quantification. To further assess the biological relevance of the STAVER algorithm on protein quantification, we compared the significantly differential proteins between COVID-19 plasma and healthy population in two independent cohorts (IPX0002186001 and IPX0002924001). Upon examination of the raw data, we found that a mere 10.9% of proteins exhibiting significant differences were shared between both cohorts. In contrast, after processing the data using the STAVER algorithm, we observed a substantial increase in the detection of differentially expressed proteins (DEPs): proteins with significant differences (determined by the Wilcoxon rank-sum test; p-value < 0.05) between both cohorts. The proportion of shared DEPs increased to 27.3% consequently, highlighting the algorithm’s effectiveness in enhancing data quality and revealing more meaningful relationships between the cohorts (Fig. 4D). This suggested that STAVER algorithm could improve the consistency and comparability of results across different platforms by minimizing technical variations and highlighting biological differences. We then compared the differential expressed proteins between different disease stages in three cohorts (IPX0002186001, PXD025752, and PXD018874). We detected that 6.2% (59) of the proteins with significant differences identified from STAVER-processed data were consistent in all three cohorts. In comparison, only 0.7% (20) of such proteins were shared from three cohorts in the raw data (Fig. 4E). This implied that the STAVER algorithm could increase the convergence effect on differential proteins by eliminating unwanted noise from non-biological variations across different platforms.

Identifying plasma protein biomarkers for patients with COVID-19 is crucial for diagnosing, evaluating prognosis, and treating this disease^23,24. Previous studies have clinically validated four plasma proteins (albumin [ALB], selenoprotein P [SELENOP], antithrombin III [SERPINC1] and platelet factor 4 [PF4]) as biomarkers for COVID-19 severity and mortality (Supplementary Table 3). These proteins are generally downregulated in COVID-19 patients compared to healthy controls, reflecting the systemic inflammation and coagulation dysfunction caused by the viral infection. However, the discovery of COVID-19 plasma protein biomarkers is hampered by the heterogeneity and inconsistency of data acquisition platforms across different cohorts and laboratories. For instance, the raw quantitive proteomic data from two independent cohorts (IPX0002924001 and IPX0002186001) that used different data-independent acquisition (DIA) mass spectrometry platforms showed contradictory results for these four proteins, with some of them being upregulated in one cohort but not the other (Fig. 4F; Supplementary Table 3). This inconsistency undermines the biological validity and reproducibility of the findings and limits their clinical utility. We then conducted STAVER algorithm to standardize the protein quantification across different platforms and to filter out unreliable or spurious signals from the data. After applying the STAVER algorithm to the raw data from the two cohorts, we observed that all four proteins showed a consistent trend of downregulation and significant biological differences in both cohorts, in agreement with the clinical criteria (Fig. 4F).

Similarly, two other COVID-19 plasma protein markers (transforming growth factor beta-induced protein [TGFBI]²⁵ and amine oxidase copper-containing 3 [AOC3]²³), which are usually upregulated in patients due to their roles in tissue remodeling and vascular inflammation, also showed inconsistent results in the raw when using raw quantitive proteomic data from the two cohorts (Fig. 4F; Supplementary Table 3). The STAVER algorithm corrected this inconsistency and revealed that both proteins were upregulated and had significant biological differences in both cohorts. These results demonstrated the power of the STAVER algorithm to unify DIA data from different platforms and improve the biological comparability and reproducibility of COVID-19 biomarker discovery.

In conclusion, the STAVER algorithm has demonstrated its ability to standardize and unify DIA proteomic data from different platforms, resulting in improved biological signal by removing unwanted noise from non-biological variations. The algorithm also increased the reproducibility and reliability of protein quantification, thereby enhanced the accuracy of differential protein analysis. Furthermore, the application of the STAVER algorithm to four published COVID-19 plasma DIA proteomics datasets has enabled the discovery of consistent and reliable plasma protein biomarkers across different cohorts and laboratories, demonstrating its potential clinical utility. In summary, the STAVER algorithm represented a powerful tool for improving the comparability and reproducibility of DIA proteomic data, facilitating the understanding of complex biological systems and processes, and accelerating the discovery of disease biomarkers.

In this study, we developed STAVER, a standardized dataset-based algorithm for efficient variation reduction in large-scale DIA MS data. STAVER leverages a reference dataset to standardize mass spectrometry signals across different runs and batches, thereby removing noise and improving protein quantification accuracy. We demonstrated that the excellent performance of STAVER algorithm for noise reduction and multi-spectral library search in several large-scale DIA datasets, covering different biological systems and platforms. STAVER also enabled us to uncover subtle biological differences that were otherwise masked by unwanted variation, showing its potential for discovering new biology.

One of the main challenges in DIA based proteomics is to achieve high quantitative precision and reproducibility across large-scale datasets²⁶. Previous studies have shown that various factors can introduce unwanted variation in DIA MS data, such as instrument drifts, sample preparation errors, matrix effects and batch effects^27,28. These sources of variation can affect the quality of protein identification and quantification, especially when using multi-spectral library search strategies that combine spectral libraries from different sources. To address this issue, several methods have been proposed to normalize or correct DIA MS data based on internal standards^29,30, external standards³¹, quality control samples^18,32, or reference peptides³³. However, these methods have some limitations, such as requiring additional experimental steps or materials, being sensitive to outliers or missing values, or relying on assumptions about the distribution of peptide intensities.

STAVER has several advantages for processing DIA MS data compared to existing methods. Firstly, the user-friendly modular-based design offers flexible compatibility and efficiency, requiring only few adjustable parameters based on empirical criteria. This approach eliminates the need for extra experimental steps or materials, addresses missing values without imputation, scales effectively to large-scale datasets with hundreds or thousands of samples, and is compatible with any existing DIA MS data analysis pipeline. Secondly, STAVER exhibits robust and versatile, employing a robust weighted penalty regression model, which is less vulnerable to outliers compared to alternative models. The approach does not assume a specific distribution for peptide intensities and accounts for various sources of variation, such as technical and biological factors. Furthermore, it is compatible with any spectral library, whether empirical or synthetic, and applicable to any DIA MS data, including label-free and labeled approaches. Third, STAVER yields biologically meaningful results by preserving biological differences between samples while minimizing unwanted variation. It enhances protein identification and quantification by through the integration of information from multiple spectral libraries and improves differential expression analysis by augmenting statistical power.

STAVER has some limitations that could be addressed in future research. One such limitation is the dependence on the availability and quality of a standardized dataset that accurately represents the samples under investigation. Although the public repositories such as PeptideAtlas could provide suitable standardized datasets for various applications^34,35, certain cases might require alternative approaches. For example, synthetic peptides or machine learning techniques could be employed to generate a standardized dataset that aligns with the samples being studied. Another limitation is the assumption of linear and homogeneous variation between samples across all peptides. While we found that this assumption is generally valid, but some situations may involve non-linear or heterogeneous variation. For instance, different peptides might exhibit varying ionization efficiencies or fragmentation patterns. In these situations, alternative models could account for non-linear or heterogeneous variation, such as non-linear regression models or local scaling factors could be employed. A third limitation is the lack of explicit accounting for biological variation between samples. While we demonstrated that STAVER preserves biological differences while reducing unwanted variation, certain cases might have confounded biological and technical variation. For example, different batches could correspond to different biological groups. In these cases, additional steps could separate biological from technical variation, such as using experimental designs that balance batches and groups, or employing statistical methods that adjust for batch effects.

STAVER represents a novel and effective approach for removing unwanted variation in large-scale DIA MS data. By using a standardized dataset as a reference for signal calibration, STAVER enhanced protein identification and quantification accuracy, especially when using multi-spectral library search strategies. Moreover, STAVER enhanced differential expression analysis by increasing statistical power and reducing false positives. With its wide applicability and utility for DIA proteomics, STAVER is compatible with any spectral library, DIA MS data type, and existing pipeline for DIA MS data analysis. STAVER provides a user-friendly and efficient tool to improve the quality and reliability of protein identification and quantification. We believe that STAVER will facilitate the adoption of multi-spectral library search as a powerful tool for comprehensive proteome profiling and discovery in biotechnology research and practice.

STAVER addresses key challenges in large-scale DIA MS data analysis by leveraging a standardized dataset to diminish unwanted variation and improve protein identification and quantification accuracy. Its robust and flexible algorithm, compatibility with diverse spectral libraries and DIA MS data types, and ability to enhance differential expression analysis make it a valuable tool for the proteomics community. Future work could focus on addressing its limitations by exploring alternative methods for generating standardized datasets, accounting for non-linear or heterogeneous variation, and incorporating explicit models for biological variation. By continuing to refine and expand upon the capabilities of STAVER, we can further propel the field of proteomics and uncover novel biological insights.

Sample preparation

Cell line sample preparation

The HEK293T cell line (National Infrastructure Cell Line Resource) was cultured in Dulbecco’s modified Eagle’s medium (DMEM; Gibico) with 10% fetal bovine serum (FBS; Gibico), 20 mM glutamine and 1% penicillin–streptomycin (Invitrogen). Cells were collected by centrifugation, washed with phosphate-buffered saline, flash-frozen in liquid nitrogen and stored at − 80°C.

Tisue sample preparation

To generated the 327 protein spectral libraries, FFPE specimens were prepared and provided by Zhongshan Hospital. A 4 µm slide from each FFPE block was used for H&E staining. For proteomic sample preparation, 10 µm slides were deparaffinized with xylene and washed with gradient ethanol. The specimens were selected according to H&E staining status and scraped. As a results, a total of 151 cancer tissues sample were collected, such as gliomas cancer³⁶, esophageal carcinoma³⁷, urothelial carcinoma³⁸, clear cell renal cell carcinoma^39,40 and etc., some of these clinically relevant multi-omics findings have been published (Supplementary Fig. 1B; Supplementary Table 1). The Research Ethics Committees of Zhongshan Hospital, Fudan University approved this study (B2019-200R) and written informed consent were obtained from all the involved patients. All materials were aliquoted and stored at − 80℃. Each sample was assigned a new research ID and the patient pathology reports were de-identified.

Plsma and body fluid sample preparation

Before enrollment, all volunteers received a comprehensive verbal explanation of the study and provided written informed consent. The study was approved by the Institutional Research Ethics Committee of Zhongshan Hospital (B2019-200R). To establish a complex proteome database of human body fluids, we collected 9 types of body fluids (ascites, bile, cerebrospinal fluid, hydrothorax, temporomandibular joint (TMJ) fluid, plasma, idiopathic pulmonary fibrosis (IPF) blood, serum, and urine) from 176 patients diagnosed with 19 common diseases. The characteristics of these 176 patients are detailed in Supplementary Table 1. All samples underwent centrifugation at 12,000 x g at 4°C for 10 minutes to eliminate cells and debris. The resulting supernatants were collected, and protein concentrations were determined using the Bradford assay (TaKaRa, Beijing, PR China). Samples were then stored at -80°C until subsequent analysis.

Proteomic analysis

Plasma and body fluid protein extraction and trypsin digestion

In this procedure, 2 uL of plasma and body fluid samples were mixed with 100 uL 50mM ABC buffer, followed by protein inactivated at 95℃ for 5 min. After cooling to room temperature, the samples were digested with trypsin at an enzyme-to-protein mass ratio of 1:50 for 16 hours in a 37 "C incubator. Then, 5 uL aqueous ammonia was added into each tube and made vortex to quench the digestion reaction. And then the supernatantwas dried in 60°C using a vacuum drier (SpeedVac, Eppendorf). After drying, the peptides were dissolved in 100 µL of 0.1% formic acid, and then centrifuged for 3 min (12,000 g). The supernatant was transferred to a new tube and then desalinated. Before desalting, the activation of pillars with 2 slides of 3M C18 disk is required, and the lipid is as follows: 100 µL 100%acetonitrile twice, 100 µL 50% and 80% acetonitriler once in turn, and then 100 µL 50% acetonitrileonce. The columns were equilibrated with two 100 µL washes of 0.1% formic acid, and the sample supernatant was loaded twice onto the columns, which were then washed twice with 100 µL of 0.1% formic acid for decontamination. Lastly, 100 µL elution buffer (0.1% formic acid in 50% acetonitrile) was added into the pillar fir elution twice andonly the effluent was collected for mass spectrometry analysis. And then the collected eluate was was dried at 60°C in a vacuum drier.

Tissue protein extraction and trypsin digestion

For MS analysis, 10 µm slides from FFPE blocks were macro-dissected, deparaffinized with xylene, and washed with ethanol. Next, 100 µL of TCEP buffer (2% sodium deoxycholate, 40 mM 2-chloroacetamide, 100 mM tris(2-carboxyethyl) phosphine hydrochloride, 10 mM tris(2-carboxyethyl) phosphine hydrochloride, and 1 mM phenylmethylsulfonyl fluoride in MS-grade water, pH 8.8) was added to 1.5 mL tubes containing the prepared samples. These were heated in a metal bath at 99°C for 30 min and cooled to room temperature. Samples were then digested with trypsin at an enzyme-to-protein mass ratio of 1:50 for 16 hours at 37°C. Peptides were extracted, dried using a SpeedVac (Eppendorf).

Tissue peptide desalination

Then, 26 µL 10% formic acid (FA) was added to each tube, after vortexing for 3 min and centrifuging at 12,000× g for 10 min, supernatants were collected and mixed with 350 µL of buffer (0.1% FA in 50% acetonitrile [ACN]). Following vortexing for 3 min and centrifuging at 12,000× g for 5 min, supernatants were transferred to new tubes and dried in a vacuum drier at 60°C. Next, peptides were dissolved in 100 µL of 0.1% FA, vortexed for 3 min, and centrifuged at 12,000× g for 5 min. Supernatants were collected and desalted using 3M C8-packed columns, which were activated with two 100 µL washes of 100% ACN, one wash of 50% ACN, and two washes of 0% ACN. The columns were equilibrated with two 100 µL washes of 0.1% FA, loaded with the sample supernatant twice, and decontaminated with two 100 µL washes of 0.1% FA. Peptides were eluted twice with 100 µL of elution buffer (0.1% FA in 50% ACN), and the eluates were collected for MS analysis. The samples were dried in a 60°C vacuum drier and stored at -80°C until LC-MS/MS analysis.

LC-MS/MS Data-independent Acquisition

To evaluate the performance of the developed STAVER algorithm, the HEK 293T cell samples were measured using LC-MS instrumentation consisting of an EASY-nLC 1200 ultra-high-pressure system (Thermo Fisher Scientific) coupled via a nano-electrospray ion source (Thermo Fisher Scientific) to a Q Exactive HF-X (Thermo Fisher Scientific). The peptides were dissolved with 12 µl loading buffer (0.1% formic acid in water), and 5 µl was loaded onto a 75 µm i.d. column (C18, 1.9µm, 120 Å, Dr. Maisch GmbH) at a maximum pressure 280 bar with 14 µl solvent A (0.1% formic acid in water). Peptides were separated on column with a linear 15–30% Mobile Phase B (ACN and 0.1% formic acid) at 300 nl/min for 10 mins. The MS analysis was performed in a data-independent manner. MS was operated under a data-independent acquisition mode. The DIA method consisted of an MS1 scan from 300–1400 m/z at 60k resolution (AGC target 4e5 or 50 ms). Then 30 DIA segments were acquired at 15 k resolution with an AGC target of 5e4 or 22 ms for maximal injection time. The setting inject ions for all available parallelizable time was enabled. HCD fragmentation was set to a normalized collision energy of 27%. The spectra were recorded in profile mode. The default charge state for the MS2 was set to 3.

LC-MS/MS Data-Dependent Acquisition

To generated the protein spectral libraries, the peptides of 327 distinct body fluid and cancer tissue samples were analyzed on the Orbitrap Fusion Lumos and QExactive HF-X Mass Spectrometer (Thermo Fisher Scientific) equipped with an Easy nLC-1200 (Thermo Fisher Scientific) and a Nanoflex source (Thermo Fisher Scientific). The peptides were re-dissolved in 12 µL loading buffer (5% methanol and 0.2% FA). Then, 6 µL or 9 µL sample was loaded on a 30 cm long, homemade C18 nano-capillary analytical column (75 mm i.d.) packed with C18 resin (particle size: 3 µm, pore size: 100 Å; Dikma Technologies) for proteome analysis. The peptides were separated onto an analytical column with a 150 min gradient (buffer A: 0.1% FA in water; buffer B: 0.1% FA in 80% ACN) at 600 nL/min (0 min, 4% B; 0 − 10 min, 4–15% B; 10 − 125 min, 15 − 30% B; 125 − 140 min, 30 − 50% B; 140 − 141 min, 50 − 100% B; 141 − 150 min, 100% B). MS was operated under a data-dependent acquisition mode. For the MS1 Spectra full scan, ions with m/z ranging from 300 to 1400 were acquired by Orbitrap mass analyzer at a high resolution of 120,000. The automatic gain control (AGC) target value was set as 3E + 06. The maximal ion injection time was 80 ms. MS2 Spectra acquisition was performed in top-speed mode. Precursor ions were selected and fragmented with higher energy collision dissociation with a normalized collision energy of 27%. Fragment ions were analyzed using an ion trap mass analyzer with an AGC target value of 5E + 04, with a maximal ion injection time of 20 ms. Peptides that triggered MS/ MS scans were dynamically excluded from further MS/MS scans for 12 s. A single-run measurement was kept for 150 min.

Peptide identification and Label-free protein quantification

All newly-generated and collected proteomic data were processed using Firmiana⁴¹. The DDA data were searched against the UniProt human protein database (updated on 2019.12.17, 20406 entries) using FragPipe (v12.1) with MSFragger (2.2)¹⁴. The mass tolerances were were set at 20 ppm for precursor and 50 mmu for product ions. Up to two missed cleavages were allowed. The search engine set cysteine carbamidomethylation as a fixed modification and N-acetylation and oxidation of methionine as variable modifications. Precursor ion score charges were limited to + 2, +3, and + 4. The data were also searched against a decoy database to accept protein identifications at a false discovery rate (FDR) of 1%. The results of DDA data were combined into spectra libraries using SpectraST software^42,43. A total of 327 libraries were used as reference spectra libraries.

DIA data were analyzed using DIANN (v1.8.1)⁸. The default settings were used for DIA-NN (Precursor FDR: 5%, Log lev: 1, Mass accuracy: 20 ppm, MS1 accuracy: 10 ppm, Scan window: 30, Implicit protein group: genes, Quantification strategy: robust LC (high accuracy). Quantification of identified peptides was calculated as the average of chromatographic fragment ion peak areas across all reference spectra libraries. Label-free protein quantifications were determined using an intensity-based, label-free approach incorporating delayed normalization and maximal peptide ratio extraction (MaxLFQ) approach⁴⁴. Peak area values were computed as components of their respective proteins.

Quality control of the MS data

Quality control of the MS platform

For the quality control of MS performance, the HEK293T cell (National Infrastructure Cell Line Resource) lysate was measured every three days as the quality control standard. A pairwise Spearman correlation coefficient was calculated for all quality control runs in the statistical analysis environment R (version 4.1.0), and the results were shown in Supplementary Fig. 1C. The average correlation coefficients of proteome and phosphorylated proteome standards were 0.92 (IQR = 0.90–0.93), exhibiting excellent reproducibility for repeat experiments with the same samples. The above results demonstrated the consistent stability of the MS platform.

Quality evaluation of proteomic data

Density plots of peptide counts and protein identifications showed a clear bimodal distribution, indicating that some samples had insufficient peptide or protein yields and should be excluded from further analysis. Therefore, the quality of the generated data was assessed by examining the distribution status of the identified proteins for all samples in the R environment (version 4.1.0). In our study, all technical repeat injections successfully passed the quality control and demonstrated excellent consistency in terms of proteome quantification (Supplementary Fig. S1D, S1E), exhibiting a typical unimodal (Gaussian or normal) distribution (statistical dip test).

Protemone data analysis

Spearman correlation analysis

In this study, the Spearman correlation analysis was utilized to assess the reproducibility between samples and the performance of mass spectrometry instrumentation runs. This non-parametric statistical method quantifies the strength and direction of association between two variables. The evaluation encompassed three distinct datasets: 20 technical repeat injections, 8 quality control (QC) samples, and a publicly accessible proteome QC dataset consisting of 39 samples. The outcomes demonstrated median Spearman's rank correlation coefficients of 0.93, 0.92, and 0.94 for the technical repeat injections, QC samples, and public proteome QC dataset, respectively. These elevated correlation coefficients signify a robust level of reproducibility and consistency within the mass spectrometry data, which is crucial for ensuring precise and dependable proteomic analyses (Fig. 2F; Fig. 3A; Supplementary Fig. S2A).

Differentially expressed proteins (DEPs) analysis

Differential expression analysis of proteins (DEPs) in experimental samples featuring distinct phenotypes was conducted using the Wilcoxon rank-sum test. This analysis encompassed comparisons between COVID-19 patient cohorts and healthy control groups, as well as among COVID-19 patients exhibiting varying degrees of disease severity. Prior to the differential expression analysis, proteomic data containing over 30% missing data were excluded. Following this, the protein expression matrix was utilized to identify differentially expressed proteins within the comparison groups, employing the Contrasts functions incorporated in the limma²² (v.3.48.3) R/Bioconductor package. P-values were adjusted using the Benjamini-Hochberg (BH) method, and proteins exhibiting an adjusted p-value < 0.05 and a fold change > 1.5 were deemed significantly differentially expressed.

Statistical methods

All analyses were performed in R (version 4.1.0) and Python (version 3.8.6) environments. Standard statistical tests were employed to analyze the proteomic data, including Wilcoxon rank-sum test and Student's t-test was used to test the statistically significant differences between the subgroups of continuous variables. The Mann-Whitney U test was utilized for evaluating significant differences between Grantham distances for subgroups of continuous variables, and Spearman correlation was used for examining relationships between proteome data, the Spearman's rank correlation coefficient was caculated. To account for multiple-testing, the P-values were adjusted using the Benjamini-Hochberg FDR correction method. All statistical tests were two-sided, and P-values < 0.05 were considered to indicate statistical significance.

Data availability

The newly generated mass spectrometry proteomics data, comprising 90 standard samples from HEK 293T cells, 20 technical repeat injections data, 8 quality control (QC) data points and 327 data-dependent acquisition (DDA) raw files for creating 327 hybrid libraries, has been deposited in the ProteomeXchange Consortium via the iProx repository. The dataset identifier is IPX0006296000, with subproject IDs IPX0006296001 and IPX0006296002 (URL: https://www.iprox.cn/page/PSV023.html;?url=16824283927944G1g; PASSWORD: 2ivG). To benchmark the STAVER software, we also utilized previously published raw data, which can be accessed from the iProX data repositories (accession numbers IPX0002186001 and PXD024707) and the PRIDE data repositories (identifiers PXD025752 and PXD018874). We have generated summarized experiment objects containing precursor and protein identification and quantification information. All supplementary materials, including supplementary tables, processed data from this manuscript, compiled public database resources, as well as additional methods and references, have been deposited in Zenodo (for further details, visit https://doi.org/10.5281/zenodo.7876049/).

Code availability

The STAVER framework has been implemented in the “dia-staver” Python package, which is available at https://github.com/Ran485/STAVER. To ensure reproducibility, the scripts utilized for all benchmarks, case studies, data analysis and generation of both main and supplementary Fig.s, illustrating the steps involved in processing the DIA proteomics data, which is also available in the above Github repository.

Author Contributions

Chen Ding, Peng Ran, and Yunzhi Wang conceptualized and designed the overall approach. Chen Ding and Peng Ran developed and implemented the STAVER algorithms. Kai Li, Jiacheng Lv, Shiman He, and Lingli Zhu conducted the experiments, and Peng Ran collected the puiblic data, while Peng Ran and Shiman He carried out the data analysis. Peng Ran and Yunzhi Wang wrote the initial manuscript draft, which was reviewed and approved by all authors. All authors actively engaged in discussions regarding the results and provided valuable input on the manuscript.

Acknowledgments

This work is supported by National Key R&D Program of China: 2022YFA1303200, 2022YFA1303201; National Key R&D Program of China: 2017YFA0505102, 2016YFA0502500, 2018YFA0507501, 2017YFC0908404, 2020YFE0201600, 2018YFE0201603, 2017YFA0505101; National Natural Science Foundation of China: 31770886, 31972933, 31700682, 32201212; Shanghai Municipal Science and Technology Major Project: 2017SHZDZX01; Shuguang Program of Shanghai Education Development Foundation and Shanghai Municipal Education Commission: 19SG02; Program of Shanghai Academic/Technology Research Leader22XD1420100; Major Project of Special Development Funds of Zhangjiang National Independent innovation Demonstration Zone: ZJ2019-ZD-004; China Postdoctoral Science Foundation: 2019M651268; National Postdoctoral Program for Innovative Talents; The Fudan original research personalized support project.

Declaration of interests

The authors declare no competing financial interests.

Doerr, A. DIA mass spectrometry. Nat Methods 12, 35–35 (2015).
Meier, F. et al. Parallel Accumulation–Serial Fragmentation (PASEF): Multiplying Sequencing Speed and Sensitivity by Synchronized Scans in a Trapped Ion Mobility Device. J. Proteome Res. 14, 5378–5387 (2015).
Meier, F. et al. diaPASEF: parallel accumulation–serial fragmentation combined with data-independent acquisition. Nat Methods 17, 1229–1236 (2020).
Krasny, L. & H. Huang, P. Data-independent acquisition mass spectrometry (DIA-MS) for proteomic applications in oncology. Molecular Omics 17, 29–42 (2021).
A data-independent acquisition-based global phosphoproteomics system enables deep profiling | Nature Communications. https://www.nature.com/articles/s41467-021-22759-z.
Zhao, J. et al. Data-independent acquisition boosts quantitative metaproteomics for deep characterization of gut microbiota. npj Biofilms Microbiomes 9, 1–14 (2023).
Data-independent acquisition-based SWATH-MS for quantitative proteomics: a tutorial - PubMed. https://pubmed.ncbi.nlm.nih.gov/30104418/.
Demichev, V., Messner, C. B., Vernardis, S. I., Lilley, K. S. & Ralser, M. DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nat Methods 17, 41–44 (2020).
SWATH-MS-Based Proteomics: Strategies and Applications in Plants - ScienceDirect. https://www.sciencedirect.com/science/article/abs/pii/S0167779920302390.
Demichev, V. et al. dia-PASEF data analysis using FragPipe and DIA-NN for deep proteomics of low sample amounts. Nat Commun 13, 3944 (2022).
Tsou, C.-C. et al. DIA-Umpire: comprehensive computational framework for data-independent acquisition proteomics. Nat Methods 12, 258–264 (2015).
Röst, H. L. et al. OpenSWATH enables automated, targeted analysis of data-independent acquisition MS data. Nat Biotechnol 32, 219–223 (2014).
Chen, Y. et al. Blood molecular markers associated with COVID‐19 immunopathology and multi‐organ damage. EMBO J 39, (2020).
Kong, A. T., Leprevost, F. V., Avtonomov, D. M., Mellacheruvu, D. & Nesvizhskii, A. I. MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry–based proteomics. Nat Methods 14, 513–520 (2017).
Jenkins, R. et al. Recommendations for Validation of LC-MS/MS Bioanalytical Methods for Protein Biotherapeutics. AAPS J 17, 1–16 (2015).
Santos, A. et al. A knowledge graph to interpret clinical proteomics data. Nat Biotechnol 40, 692–702 (2022).
Lou, R. et al. Benchmarking commonly used software suites and analysis workflows for DIA proteomics and phosphoproteomics. Nat Commun 14, 94 (2023).
Messner, C. B. et al. Ultra-High-Throughput Clinical Proteomics Reveals Classifiers of COVID-19 Infection. Cell Systems 11, 11-24.e4 (2020).
Li, H. et al. Plasma proteomic and metabolomic characterization of COVID-19 survivors 6 months after discharge. Cell Death Dis 13, 1–12 (2022).
Völlmy, F. et al. A serum proteome signature to predict mortality in severe COVID-19 patients. Life Science Alliance 4, (2021).
Demichev, V. et al. A time-resolved proteomic and prognostic map of COVID-19. Cell Systems 12, 780-794.e7 (2021).
Ritchie, M. E. et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43, e47 (2015).
Ponti, G., Maccaferri, M., Ruini, C., Tomasi, A. & Ozben, T. Biomarkers associated with COVID-19 disease progression. Critical Reviews in Clinical Laboratory Sciences 57, 389–399 (2020).
Zhang, L. & Guo, H. Biomarkers of COVID-19 and technologies to combat SARS-CoV-2. Advances in Biomarker Sciences and Technology 2, 1–23 (2020).
Malik, P. et al. Biomarkers and outcomes of COVID-19 hospitalisations: systematic review and meta-analysis. BMJ Evid Based Med 26, 107–108 (2021).
Strategies to enable large-scale proteomics for reproducible research | Nature Communications. https://www.nature.com/articles/s41467-020-17641-3.
Wehrens, R. et al. Improved batch correction in untargeted MS-based metabolomics. Metabolomics 12, 88 (2016).
Čuklina, J. et al. Diagnostics and correction of batch effects in large‐scale proteomic studies: a tutorial. Mol Syst Biol 17, e10240 (2021).
Kitata, R. B., Yang, J.-C. & Chen, Y.-J. Advances in data-independent acquisition mass spectrometry towards comprehensive digital proteome landscape. Mass Spectrometry Reviews n/a, e21781.
Ong, S.-E. & Mann, M. Mass spectrometry–based proteomics turns quantitative. Nat Chem Biol 1, 252–262 (2005).
Rozanova, S. et al. Quantitative Mass Spectrometry-Based Proteomics: An Overview. in Quantitative Methods in Proteomics (eds. Marcus, K., Eisenacher, M. & Sitek, B.) 85–116 (Springer US, 2021). doi:10.1007/978-1-0716-1024-4_8.
Midha, M. K. et al. DIALib-QC an assessment tool for spectral libraries in data-independent acquisition proteomics. Nat Commun 11, 5251 (2020).
Ludwig, C. et al. Data-independent acquisition-based SWATH-MS for quantitative proteomics: a tutorial. Molecular Systems Biology 14, e8126 (2018).
Desiere, F. The PeptideAtlas project. Nucleic Acids Research 34, D655–D658 (2006).
Deutsch, E. W., Lam, H. & Aebersold, R. PeptideAtlas: a resource for target selection for emerging targeted proteomics workflows. EMBO Rep 9, 429–434 (2008).
Wang, Y. et al. Proteogenomics of diffuse gliomas reveal molecular subtypes associated with specific therapeutic targets and immune-evasion mechanisms. Nat Commun 14, 505 (2023).
Li, L. et al. Integrative proteogenomic characterization of early esophageal cancer. Nat Commun 14, 1666 (2023).
Xu, N. et al. Integrated proteogenomic characterization of urothelial carcinoma of the bladder. J Hematol Oncol 15, 76 (2022).
Qu, Y. et al. A proteogenomic analysis of clear cell renal cell carcinoma in a Chinese population. Nat Commun 13, 2052 (2022).
Qu, Y. et al. Proteogenomic characterization of MiT family translocation renal cell carcinoma. Nat Commun 13, 7494 (2022).
Feng, J. et al. Firmiana: towards a one-stop proteomic cloud platform for data processing and analysis. Nat Biotechnol 35, 409–412 (2017).
Lam, H. et al. Building Consensus Spectral Libraries for Peptide Identification in Proteomics. Nat Methods 5, 873–875 (2008).
Lam, H. Building and Searching Tandem Mass Spectral Libraries for Peptide Identification*. Molecular & Cellular Proteomics 10, R111.008565 (2011).
Cox, J. et al. Accurate Proteome-wide Label-free Quantification by Delayed Normalization and Maximal Peptide Ratio Extraction, Termed MaxLFQ. Mol Cell Proteomics 13, 2513–2526 (2014).

There is NO Competing Interest.

SupplementaryFigsandTables.docx
SupplementaryTable1.xlsx
Supplementary-Table1
SupplementaryTable2.xlsx
Supplementary-Table2
SupplementaryTable3.xlsx
Supplementary-Table3
Accesstodata.docx
Access to data

Download PDF

Version 1

posted

You are reading this latest preprint version

STAVER: A Standardized Benchmark Dataset-Based Algorithm for Effective Variation Reduction in Large-Scale DIA MS Data

Status:

Version 1

Abstract

Figures

Introduction

Results

An Advanced Proteomics Platform Ensures High Quantitative Precision in Large-Scale DIA MS Data

High Performance of STAVER Algorithm in Removing Unwanted Noise in Proteome Datasets

Robust generalization performance of the STAVER algorithm Across diverse platform output datasets

Discussion

Materials and Methods

Sample preparation

Cell line sample preparation

Tisue sample preparation

Plsma and body fluid sample preparation

Proteomic analysis

Plasma and body fluid protein extraction and trypsin digestion

Tissue protein extraction and trypsin digestion

Tissue peptide desalination

LC-MS/MS Data-independent Acquisition

LC-MS/MS Data-Dependent Acquisition

Peptide identification and Label-free protein quantification

Quality control of the MS data

Quality control of the MS platform

Quality evaluation of proteomic data

Protemone data analysis

Spearman correlation analysis

Differentially expressed proteins (DEPs) analysis

Statistical methods

Declarations

Data availability

Code availability

Author Contributions

Acknowledgments

Declaration of interests

References

Additional Declarations

Supplementary Files

Status:

Version 1