Significance of Benchmark Studies for the Field of Proteomics
Benchmark studies have become an invaluable tool to objectively assess the advantages and disadvantages of the choices made over the course of proteomics studies, including the choice of sample preparation, data acquisition, MS data analysis and statistical processing.
However, benchmark studies often suffer from small sample sizes and unrealistically low background variance 5,12,13.
Biomarker discovery studies often include hundreds of patients with heterogeneous proteomes and can contain high within- and between-person variance. In this setting, data characteristics and the resulting performance of data analysis workflows cannot be estimated using standard benchmark datasets.
Here, we set out to empirically investigate the performance of complete multi-step data analysis workflows for DIA-type proteomics, composed of library generation methods, DIA analysis software suites, data preprocessing, and statistical analyses.
Overview of LC-MS/MS Measurements and Initial Results
We obtained 92 FFPE-embedded tumor free lymph node tissue specimens derived from patients with primary acinary prostate cancer. Following protein digestion and sample clean-up, we split the samples into four groups and added E. coli peptides in human : E. coli peptide ratios of 6:1, 12:1, and 25:1, or did not add E. coli peptides at all. Those four groups are referred to as ‘spike-in conditions’ and have a size of n=23 each (Figure 1). The resulting set of DIA LC-MS/MS measurements (92 LC-MS/MS files) consists of 12.4 million MS2 spectra.
To generate experiment-specific spectral libraries, we performed GPF on a mastermix, which represents an average spike-in concentration of human to E. coli peptides of 15:1. Using DIA-NN to refine a combined human and E. coli in silico predicted DIA-NN spectral library, we generated a spectral library containing 84016 precursor entries mapping to 10459 proteins. Using an in silico predicted PROSIT spectral library refined by EncyclopeDIA, we generated a spectral library containing 45445 precursors mapping to 8472 proteins 14.
We also pre-fractionationated a mastermix to obtain samples for in-depth DDA library generation 15. Using Fragpipe to generate a spectral library from these DDA files, we generated a spectral library containing 81409 precursors mapping to 7781 proteins. We also used MaxQuant to build a DDA-based spectral library containing 51260 precursors mapping to 7382 proteins.
Using DIA-NN in combination with the in silico predicted DIA-NN GPF-refined spectral library, on average 48,698 precursors were identified per measurement with an average chromatographic peak width of 8 seconds (full width at half height). No batch effects originating from sample preparation or order of measurement were apparent in this analysis (Supplementary Figure S1).
Assessment of data analysis workflows for DIA-type proteomics
We chose four commonly used DIA software analysis suites: DIA-NN, Skyline, OpenSwath, and Spectronaut. Whenever possible, we combined all generated libraries with all DIA analysis software solutions (especially the predicted spectral libraries pose a challenge to some software suites in combination with the high number of samples). We also included ‘DirectDIA’, a feature of Spectronaut, which does not require any additional experimental evidence for library generation. This resulted in a total of 17 different DIA analysis workflows. For all subsequent analysis steps, protein-level output from the DIA analysis workflows was used. For the sake of simplicity we focused our statistical analyses on the comparison between the two lowest E. coli spike-in conditions (Figure 1). This also represents the greatest challenge to any DIA analysis software as quantitations are usually less precise for lowly abundant proteins 16,17.
In this study, we used bootstrapping (see below) to investigate the effect of sample size, normalization, sparsity reduction and choice of a statistical test on the overall ability of the data analysis workflow to detect differentially abundant proteins.
In brief, we randomly drew samples from each of the two lowest E. coli spike-in conditions with group sizes of three to 23 samples. On each bootstrap dataset we applied different data analysis workflows composed of multiple options for the preprocessing steps in the form of sparsity reduction and normalization, followed by one of five statistical tests to identify differentially abundant proteins.
Taking into account the aforementioned 17 different types of LC-MS/MS data processing, we acquired prediction performance information for 1020 different analysis workflows, each of which was applied to 2100 bootstrap datasets resulting in over 2 million analyses.
This staggering number beautifully illustrates the amount of possible combinations of library generation methods, DIA software suites, and downstream data preprocessing and statistical analysis methods proteomics scientists are confronted with. As every study is different and there are no truly universally applicable methods available in proteomics, the level of experience and choices of the proteomics data analyst determine the reliability and reproducibility of a proteomics study, which was impressively demonstrated by Choi et al 13.
Analyses of the LC-MS/MS Data
We first assessed the number of identified and quantified proteins with a 1 % protein FDR cutoff being applied to all workflows (Figure 2 left panel). As the number of identified proteins depends on the number of proteins, which are physically present in a sample, the samples are grouped by spike-in condition. The DDA spectral libraries consistently led to smaller numbers of identified proteins. As the tissue used in this study was formalin-fixed, chemical modifications can reduce the number of identified peptides and proteins during spectral library generation. In our experience, GPF refined spectral libraries often lead to higher identification rates in DIA-type proteomics data of FFPE tissue. The total number of identifiable proteins increases with increasing amounts of spiked-in E. coli proteins. Using a GPF-refined in silico predicted PROSIT spectral library in combination with Skyline yielded the highest number of quantified proteins, ranging from a mean value of 7388 proteins for samples without spike-in to a median of 7480 proteins for the samples with a human: E. coli spike-in of 6:1 (w/w).
However, in quantitative proteomics, protein identifications only serve a useful purpose if they are accompanied with robust and reliable quantitation. When summarizing the protein abundances as calculated by the different DIA analysis workflows both the shape of the distribution of log-transformed protein abundances (Figure 2 center panel) as well as the correlation of log-transformed intensities between DIA analysis workflows mostly depend on the choice of DIA analysis software, and to a lesser extent on the spectral library (Supplementary Figure S2).
Further, we determined the variance of individual E. coli protein intensities per spike-in ratio. Since one single batch of E. coli was used for all spike-ins (thus reducing inter-sample variability between E. coli proteins), a minimal variance indicates a reproducible quantitation algorithm. While the absolute variance of protein intensities is similar across all DIA analysis workflows, the DIA workflows differ in how much this variance varies across each spike-in condition: for DIA-NN and Spectronaut DIA workflows, the variability of variances decreases with higher E. coli spike-in concentrations (except for DIA-NN in combination with the refined PROSIT spectral library), while the opposite behaviour can be observed for OpenSwath and Skyline (with the exception of human to E. coli spike-in condition 6:1 (Figure 2 right panel). Therefore, we specifically investigated the reported quantitations for E. coli proteins (Supplementary Figure S3). While the number of identified and quantified E. coli proteins decrease in lower spike-in conditions, it remains constant for some DIA analysis software suites. This led us to more closely investigate how different DIA analysis suites report missing proteins.
Missing Values and False-Positive Quantitation
DIA-type proteomics promises to reduce missingness in multi-sample proteomic experiments 11. In the present dataset, 25% of all samples are human-only and void of E. coli proteins. This experimental setting not only supports the illustration of missingness, but also the illustration of false-positive quantitation of proteins (here: E. coli proteins).
As can be appreciated from Figure 3A, the means of the human and E. coli protein intensities correlate negatively with the percentage of missing values per protein for all DIA software suites except for Skyline. This negative correlation has also been reported previously in a clinical proteomics study employing a tripleTOF instrument using OpenSwath for data analysis 18. Furthermore, DIA-NN and Spectronaut correctly yield a level of 25 % missingness for most E. coli proteins, while Skyline and OpenSwath do not clearly reach this distinction. The nature of the spectral library only had a negligible impact in this regard in most cases. However, for DIA-NN the number of reported E. coli proteins for samples, which only contained human lymph node proteins increased when using the EncyclopeDIA-refined in silico predicted PROSIT spectral library.
Our observation suggests that OpenSwath and Skyline tend to report background ‘noise’ instead of missing values when no proteins can be confidently identified and quantified. In contrast, DIA-NN and Spectronaut tend in our analyses to report missing values when the proteins are physically absent in a sample.
Furthermore, we assessed how the missingness within each sample correlates with the sample mean of protein intensities. While for DIA-NN and Spectronaut this correlation is positive showing a separation of the spike-in conditions by sample mean of protein intensities, it is negative for Skyline and OpenSwath without showing such a separation (Figure 3B).
We hypothesize that the counter-intuitive positive correlation between protein intensity and missingness, as in the case of DIA-NN and Spectronaut, may be due to sample-dependent detection thresholds 19. In other words, if the intensity of a protein lies below such a threshold, it is not included into the calculation of the sample mean of the protein intensities, thus, increasing the weight of proteins with higher intensities. This in turn increases the sample mean of protein intensities.
The implications of these findings are far-reaching and should be taken into consideration when planning studies, as in practically all proteomics experiments missing proteins are an issue that needs to be addressed, and in some studies such as knockout experiments or clinical biomarker discovery studies missing values are even of special interest. To our knowledge, detection of true missingness and false-positive quantitation is rarely investigated in benchmarking studies. Our dataset offers a well-suited platform to investigate (and possibly optimize) these aspects for future toolsets.
Post-Processing and Measures of Performance Used in this Study
Although complex in its own realm, protein and peptide identification and quantitation from LC-MS/MS data are only the beginning of the complete analysis of a multi-sample, quantitative proteomics experiment. Subsequent steps typically include sparsity reduction, normalization and, ultimately, statistical assessment of differential protein abundance. For each of the aforementioned steps different algorithms exist, yielding a variety of possible combinations.
To investigate the performance of the analysis methods in different possible combinations, we jointly assessed commonly used approaches for sparsity reduction, normalization, and different statistical tests. For sparsity reduction we applied: a) no sparsity reduction (NoSR), b) requiring > 66% values per protein (SR66), and c) requiring > 90% values per protein (SR90). Four different methods were then applied to investigate the effect of normalization: a) unnormalized, b) quantile normalization (QN), c) tail-robust quantile normalization (TRQN), and d) median normalization. We then subjected each possible combination to five statistical tests to probe for differentially abundant proteins.
To systematically evaluate the performance of each of the above mentioned parameters, we focused on a sub-dataset, representing the two lowest E. coli spike-in conditions. We used bootstrapping to quantify the uncertainty of the observed assessment score and to investigate the effect of sample size on the overall ability of the data analysis workflow to detect differentially abundant proteins. To this end, we randomly drew (with replacement) from the set of samples of the two lowest E. coli spike-in conditions to receive group sizes of three to 23 samples. On each bootstrap dataset we applied all combinations of the aforementioned sparsity reductions, normalizations, and statistical testing options to determine differentially abundant proteins.
To objectively compare the performance of the different data analysis workflows, we introduced a unifying measure of performance for detecting differentially abundant proteins. The experimental design with the known E. coli spike-in conditions provides us with ground truth information based on which we can assess true positives (E. coli proteins, which are determined to be significantly differentially abundant between the two spike-in conditions), false positives (human proteins determined to be significantly differentially abundant between the two spike-in conditions), false negatives (E. coli proteins determined to be non-significantly differentially abundant between the two spike-in levels), and true negatives (human proteins determined to be non-significantly differentially abundant between the two spike-in conditions). For each protein, we can then plot the true positive rate against the false positive rate to obtain a receiver operating characteristic (ROC) curve. To measure the ability of each workflow to detect differentially abundant proteins, the area under the ROC curve is then determined. We use the partial area under the curve (pAUC) for all analyses, as the pAUC reflects the performance in the relevant range of false positives (Figure 4A) (Walter 2005). While the AUC can be interpreted as the average sensitivity over the whole range of specificities, pAUCs correspond to the average sensitivity over a relevant (mostly high) specificity range only. In the literature, calculations of pAUCs for different specificities have been used 6,20. Here, we focus on specificities larger than 80% (i.e. false positive rate (FPR) < 20%) for the pAUC calculations.
Although the fold-changes of the spiked-in E. coli proteins are known through our study design, it is unknown which human and E. coli proteins were actually present in the biological sample in the first place. Since the calculations of sensitivities and specificities strongly depend on the definition of the set of proteins present, we calculated them based on three different protein lists. This allows us to evaluate the robustness of the outcomes, while ensuring that no software or library is favoured.
The proteins, which are present in the DIA analysis workflow, i.e. the bootstrap dataset under investigation, are collectively referred to as ‘DIA workflow’ proteins (Supplementary Figure 4 & 5). The list of proteins, which were identified in at least one of the DIA analysis workflows is referred to as ‘Combined’ (11,516 Human proteins, 2,127 E. coli proteins). The list of proteins, which were identified in more than 80% (at least 14 out of 17) of the DIA analysis workflows is referred to as ‘Intersection’ (4,499 Human proteins, 745 E. coli proteins), and represents a core protein set.
Evaluation of Post-Processing Approaches and Statistics of Differential Abundance
The highest pAUC values are achieved when no sparsity reduction is performed, while more strict criteria for missing values lead to a decrease in pAUC (Figure 4B). However, depending on the measure of performance being used, different aspects of an analysis workflow are rewarded or punished, representing an underlying challenge in benchmark studies. In our study, the reference protein list based on which the ROC is generated represents an additional that impacts the outcome of our comparisons.
The pAUC values shown in Figure 4B have all been generated using the ‘DIA workflow’ protein list. When we removed protein entries based on sparsity reduction in the ‘DIA workflow’, the pAUC values decreased. While the removal of proteins via sparsity reduction can lead to a situation in which the maximum sensitivity cannot be reached, the same can happen if the reference protein list, based on which sensitivity and specificity are calculated, is larger than the list of proteins, for which statistical results are available. This indicates that the choice of an appropriate reference protein list is crucial.
Furthermore, we observed a steep initial increase in the ROC curve for SR90, after which a plateau is reached. Differences in the pAUC are then solely based on the height of this plateau, which itself depends on the number of quantified E. coli proteins in a given dataset. This steepness decreases from SR90 over SR66 to NoSR (data not shown).
If testing for differential abundance of a protein returned a missing value, the p-value for this comparison was set to one. As human proteins are overrepresented in our benchmark dataset, this might lead to a bias when performing sparsity reduction, limiting inter-comparability of sparsity reduction levels.
We next investigated the relative performance of each DIA analysis workflow separated by reference protein list (Figure 4C). Note the tri-model distribution of pAUC, which is particularly prominent for DIA-NN and Spectronaut, and can be explained by the three included sparsity reduction options. The performance of some DIA data analysis workflows differs drastically between the reference protein lists.
“Within workflow” performance
Using the ‘DIA workflow’ protein lists to measure the prediction performance of differentially abundant proteins, we find that Spectronaut’s ‘DirectDIA’ performs best. DIA-NN, Skyline and Spectronaut all perform well using the more classical DDA spectral libraries generated by MaxQuant and MSFragger. Combining OpenSwath with the MSFragger-based spectral library leads to a better prediction performance than combining it with the MaxQuant spectral library. Overall, the GPF-refined libraries show an inferior performance, except for the refined DIA-NN spectral library in combination with OpenSwath.
“Overall sensitivity” performance of each workflow compared to all other workflows
When using the Combined reference protein list, the GPF-refined libraries, but not the in silico predicted DIA-NN unrefined library, perform well for DIA-NN and OpenSwath workflows. These libraries do not, however, perform as well for Spectronaut. Skyline clearly performs better with the refined PROSIT spectral library as compared to the refined DIA-NN spectral library for this specific reference protein list. Also, the DDA-based spectral libraries perform worse than in the case of the DIA workflow protein list.
Performance against “core protein dataset”
When using the ‘Intersection’ reference protein list, on average, DIA-NN performs slightly better compared to the other software solutions. The refined DIA-NN library (also in combination with OpenSwath, but not with Skyline and Spectronaut) leads to a good prediction performance against the protein core dataset.
The performance of each DIA analysis software suite strongly depends on the spectral library with which it is combined, and on the protein list against which it was benchmarked. For instance, while the relative performance against the DIA workflow protein list was below average for the combination of refined DIA-NN spectral library and DIA-NN analysis, the performance against the Combined protein list and against the Intersection protein list was among the best. In contrast, Skyline in combination with the refined PROSIT spectral library does not perform well when measured against the respective protein list, but performs well when measured against the Combined protein list.
These data highlight the strengths, but also the limitations of any spectral library-DIA software combination. While some combinations lead to a high number of reported proteins, others will give more accurate results. This can also be observed when comparing the detected log2 fold-changes of E. coli proteins with the actual fold-changes of the spiked-in E. coli lysates (Supplementary Figure S6).
Normalization
We found that virtually all DIA software-spectral library combinations do not benefit from normalization and perform best with non-normalized data (Supplementary Figure S7). All normalization methods included in this study normalize by distribution and, thus, act under the assumption of a symmetric differential expression, i.e. that the number of up- and down-regulated proteins is equal 21. In this benchmark dataset, however, the differentially expressed proteins solely change in one direction. Thus, we hypothesize that the observed decline of performance when normalization is performed could, at least partly, be an artifact of the study design.
Furthermore, due to the higher number of human than E. coli proteins in the samples, the impact of human proteins on the normalised outcome is higher. As a result, the distribution of the human proteins is comparable across the normalised samples, while this is not the case for the E. coli proteins. This might lead to a bias in the identification of differentially abundant E. coli proteins.
Additionally, all employed normalization steps assume that the relative abundance of proteins within one sample can be used to normalize all proteins. This, however, cannot be assumed for this dataset as E. coli and human proteins were pipetted separately, which leads to changes in the protein abundance ranks between samples (which are assumed to be stable by the normalization methods). This highlights the need to employ special strategies to evaluate normalization strategies in future benchmark studies. A dilution series of the same samples being measured with different injection amounts may be more suitable to investigate normalization methods.
Statistical tests
Finally, we evaluated the prediction performance of statistical tests for two-group comparisons, which have previously been used in proteomics data analysis (Figure 4D), again for all three reference protein lists. In general, the non-parametric tests ROTS and SAM consistently perform best for all DIA analysis workflows. Interestingly, SAM performed better than ROTS with the exception of data analysed by Skyline, where ROTS consistently performed best (Supplementary Figure S8). However, the superiority of SAM over ROTS may also be due to the set hyperparameters, as the SAM statistic can be derived from the more general ROTS statistic. The good performance of non-parametric methods has been described previously 20,22. Interestingly, SAM did not perform well in the study of Pursiheimo et al, who caution when using SAM but highlight the good performance of ROTS 23.
Connection between data characteristics and statistical prediction performance
We investigated the connection between data properties of the bootstrap datasets and statistical prediction performance. As we identified DIA-NN in combination with the in silico predicted GPF-refined spectral library as an overall well performing DIA analysis workflow, we further investigated, which data properties (Supplementary Figure S9) correlate with benchmarking performance measures (Supplementary Figure S10).
In general, sample variance, kurtosis, skewness and the ratio of variances between two spike-in conditions only show little influence on the performance of statistical tests to detect differentially abundant proteins. The correlation behavior of the remaining data characteristics differ between the DIA analysis workflows (not shown).
As we included different sample sizes during bootstrapping to mimic limited replicate availability, we were also able to investigate the performance of the different DIA analysis workflow for different sample sizes (Supplementary Figure S11). We observed a moderate positive correlation between pAUC values and sample size for all DIA analysis workflows (exemplarily shown in Supplementary Figure S12 for DIA-NN in combination with the in silico predicted GPF-refined spectral library).
Influence of sample size on statistical testing and performance
Overall, SAM performed best for all sample sizes over all workflows, except for Skyline, which showed the best performance in combination with ROTS. However, irrespective of the software suite, for small sample sizes (n < 5) limma achieved a similar performance to SAM and ROTS. Van Ooijen et al., who compared different statistical tools to detect differentially abundant proteome features, also found limma to perform well for small sample sizes 24.