Missing values are widespread amongst analyzed samples
The renal cancer dataset RC is more complex than the single-tissue benchmark dataset RCC. This is because RC comprises two phenotype classes, has higher individual variability (due to more patients), and ~3x more data holes (36%). Although many proteins observed in RC and RCC are shared (Figure 1B), a quick check on the dispersal of missing proteins across samples in RC also indicates that the missing proteins are dispersed across a wider range of proteins (Figure 1C and Figure 1D).
Since we want to check for missing protein recovery across technical replicates, it is important that batch effects do not dominate outcome 17. Figure 1C/D show the relationships between samples annotated by class and batch (The naming nomenclature is class_sample number_batch; e.g. N2_2 means “Normal” sample 2, batch 2) where we ascertained no obvious batch effects, i.e., the sample do not group broadly by the batch labels (Figure 1D).
FCS-predicted complexes are tissue specific and biologically relevant
Given FCS, each sample can be represented in terms of its statistically significant networks or protein complexes. But is this representation biologically meaningful?
We first consider the distribution of FCS p-values (calculated on protein complexes) across samples in RCC, and the two sample classes of RC (RC_N and RC_C, where N and C refers to normal and cancer classes respectively); cf. Figure 3A. Although many significant complexes are shared amongst samples (blue zones), there is a high degree of obfuscation and uncertainty, as represented by the thick mixed color columns in the middle of the heatmaps. This suggests that different samples are predicting a notable proportion of different complexes even though they belong to the same class (and expected to report the same complexes as significant).
Despite this apparent heterogeneity amongst same-class samples, we are curious whether there is conserved signal amongst significant complexes (FCS p-value below 0.05) reported in the same tissue-type, despite different proteomics screen. Based on the inter-sample agreement for RCC, RC_N and RC_C, we find that the Jaccard indices are relatively high (~0.65 to 0.70), compared to overlaps against significant complexes derived from another tissue (colorectal in this case); cf. Figure 3B. Although overlaps fall when we consider similarity of significant complexes between RCC and RC_N (RC_N <-> RCC), the Jaccard indices are still appreciably higher than when we compare RC_N to CR (RC_N <-> CR) and RCC to CR (RCC <-> CR) (Figure 3B). A two-sample t-test shows that the distribution of Jaccard indices for RC_N <-> RCC is significantly higher against (RC_N <-> CR) and (RCC <-> CR) (p-value << 0.01; ***). This means that despite the apparent heterogeneity (in terms of significant complex agreements) amongst same-class samples, there is conserved signal amongst samples derived from the same tissue type, even across different proteomics screens (as with RCC and RC_N).
For each complex overlap between sample pairs, we may determine a significance measure based on the hypergeometric p-value (Figure 3C). Here, regardless of same tissue on same proteomics screen, same tissue on different proteomics screen, or cross-tissue on different proteomics screen, the hypergeometric p-values are all generally low (p-value << 0.01). We speculate this is due to high numbers of shared complexes (e.g., housekeepers --- transcriptional, translational and protein degradation machinery, etc.) common to many different tissue types anyway (Supplementary Figure 2). However, it is noteworthy that the p-values for cross-tissue comparisons appear somewhat less significant, possibly due to lower inter-tissue overlaps (Figure 3C).
The proteins corresponding to significant complexes unique to liver and kidney may be tissue discriminatory: Based on the Fragments Per Kilobase Million(FPKM) normalized transcriptome profile across 14 different tissues (Human BodyMap 2.0; http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-513/) 18, we examined gene expressions corresponding to proteins from significant complexes common to RC_N and CR, and proteins from significant complexes unique to RC_N, and proteins from significant complexes unique to CR. The genes are clustered based on hierarchical clustering (Euclidean distance; average linkage). It appears that when examining shared genes (that code for proteins belonging to common complexes), kidney and liver are closely spaced amongst the various tissue types but when considering unique genes (that code for proteins belonging to tissue-specific complexes), the liver and kidney tissues are more widely spaced apart (Supplementary Figure 2). This observation, together with the earlier observation that significant complexes are conserved with respect to tissue type, allows us to infer that FCS makes biologically relevant predictions, in line with the biological characteristics of the tissue class being examined.
FCS-based cross-examination of technical replicates yields modest recovery of missing proteins
Via FCS, we may determine the extent and significance of recovery based on verification on three strategies: Based on the set of proteins corresponding to all significant PSMs in the same sample (Figure 4A), on the set of proteins corresponding to all significant PSMs in the cross-batch replicate (Figure 4B), and on the union of the set of proteins corresponding to all significant PSMs in the same sample and cross-batch replicate (Figure 4C). The notation in Figure 4, e.g., N T1 -> N T2, means N for normal, T is for technical replicate, the direction of the arrow means we are comparing the proteins recovered based on the significant complexes from sample N T1, and checking them against the proteins identified in N T2. We consider each sample (from patient samples 1 to 6) separately. The results in each cell of Figure 4 are shown as two rows: the top row shows the overlap |r|/|R| and its associated p-value on the left and right respectively (see Materials and Methods). The bottom row shows the total number of predicted missing proteins and the number of verified missing proteins on the left and right respectively.
As additional comparisons, we also verify based on observed proteins (i.e., the finalized set of proteins reported in the proteomics screen for a given sample) in the cross-batch replicate (Supplementary Figure 3A) and verification based on the proteins from significant complexes in the cross-batch replicate (Supplementary Figure 3B). It is useful to discuss these two naïve scenarios first: In the former, recovery is extremely low. Not all recoveries are statistically significant, and verification rate is around 2 to 5% (Supplementary Figure 3A). On the other hand, in the latter where we compare missing proteins predicted to be present in one replicate against the FCS-significant complexes in the corresponding cross-batch replicate, the overlap shoots up dramatically to ~90% (Supplementary Figure 3B). Although cross-batch replicates do not report the same protein sets, these proteins nonetheless map back generally to the same protein complexes in the same sample. However, both of these recovery verification methods are not robust: In the former, the verification rate is too low to be useful. This is not surprising; otherwise, taking multiple technical replicates would have easily resolved MPP (thus absolving the need for research in this area). Unfortunately, this data tells us that missing proteins tend to be harder to observe/recover generally (see next section). In the latter scenario, we focus on direct verification of significant protein complexes between cross batches, and not on mutually supportive predictions of missing-but-present proteins. But this comparison, naïve as it is, is also useful as it tells us that despite the different reported proteins between technical replicates of the same samples, we nonetheless still predict similar complexes. Although gratifying from the perspective some biological signal is evidently conserved, this does not change the fact that replicates from different samples still report quite a lot of different significant complexes which may not be meaningful (Figure 3A).
For verification of predictions of missing-but-present proteins, the PSM list (where proteins with at least one representative peptide are listed) is used for determining whether there is evidence that a predicted missing protein is indeed present. Interestingly, despite the differences in observed proteins, self-recovery and cross-batch replicate recovery have similar results of ~20% recovery rate (Figures 4A and 4B). The cross-batch replicate recovery rates are slightly higher however.
Taking the union of the PSM lists from self and cross-batch replicate increases verification rates modestly from ~ 20% to ~25%; cf. Figure 4C. Although this gives rise to an appreciable improvement of 25% (i.e., 25 − 20 over 20), verification rates are still low. Apparently, where RC is concerned, most predicted missing proteins (~75%) cannot be verified in this manner due to the lack of any supporting PSMs. However, we think there may be a silver lining. In particular, we expect that given more technical replicates and more support, it is possible to improve recovery beyond 25% (as rarer PSMs become observable), although we cannot say by how much more, whether the recovery proportion can become predictable as a function of replicate size, or whether recovery proportion predicted on one dataset is generalizable to other tissues/datasets.
Peptide support is a stronger contributing component towards missing proteins than low abundance
Low abundance is frequently cited as a cause for MPP 19. The reasoning for this stems from the semi-stochastic loss of proteins in Data-Dependent Acquisition (DDA) paradigm proteomics screens where smaller signals corresponding to low abundance are more likely overlooked. However, low abundance cannot be attributed as a strong or sole contributing factor for the missing proteins observed in the RC dataset where even at higher abundance levels, missing proteins exist nonetheless (Supplementary Figure 1A). The observation that missing proteins are also frequent at high abundance levels is surprising but also reported before by Webb-Robertson et al in 2015 19. Moreover, for relatively high-abundance proteins (greater than the median expression level), there does not appear to be any difference for missing values below or above the median missing-value level (Supplementary Figure 1A). Hence, an alternative explanation is needed to better understand why missing proteins occur.
In the Data Independent Acquisition (DIA)-derived paradigm, there is no semi-stochastic preselection of precursor peptides based on signal intensity, all spectra are captured if it falls within detection limit. The lack of association between low-abundance proteins and increased missing values (Supplementary Figure 1A) is consistent with the nature of DIA, and perhaps is an artifact associated with the older DDA paradigm (higher-intensity precursor spectra tends to be selected for identification, creating the correlation between low abundance and non-detection). Instead, in DIA, we find that low-confidence PSMs and low peptide support for proteins are generally stronger contributing factors towards MPP. Figure 4D shows the distributions of peptide support for Internal (Observed proteins), Recovered (Verified proteins) and External (Proteins that were neither observed nor predicted to be missing in the cross-batch replicate). Observed proteins tend to have the highest peptide support while predicted missing proteins (expected to be present), has relatively lower peptide support. However, unpredicted proteins not observed in the cross-batch replicate have the least peptide support. It is plausible that proteins with lower peptide support may not consistently meet the statistical threshold required when converting PSMs (based on peptides) to the finalized observed protein list, and this leads towards MPP (and data holes in the observed protein expression matrix).
The results (Figure 4D) also boost credibility for complex-based missing protein prediction, since the recovered proteins based on significant complexes are more enriched for higher peptide support than those not predicted to be recoverable at all.
Unverified predicted missing proteins may not exist in tissue in first place
Without prior knowledge on all protein complex families in CORUM, it is difficult to conclude that highly overlapping but significant complexes are contributing to a good number of non-verifiable proteins. While CORUM is a manually curated database concerned with the annotation of biologically relevant complexes, it does not provide a convenient way of ascertaining which complexes belong to the same major family and have tissue-specific properties. For example, the nBAF and npBAF complexes have many similar components but are found in different tissues 20. In that regard, condensing complexes based on shared components also does not give rise to biologically coherent entities 21.
In the absence of tissue-specific information allowing us to only consider kidney tissue-specific complexes, we concocted a simple check as we believe that since most proteins at the smallest FCS p-values are considered important, it is possible that tissue specificity (of complexes) may contribute towards some degree of non-verification (i.e., we are considering irrelevant complexes that are significant because of deep sharing of core proteins with a tissue-specific relevant complex). This test is important: If proven correct, then it means we are severely underestimating the recovery rates based on networks because of tissue-specificity issues.
Using sample N1 and its peptide list derived from both its technical replicates, we have a total of 62 unverified proteins, and 557 (observed + verified) proteins. We mapped each of the 62 unverified proteins to the largest complex it is a component of and generated an observed overlap with median of 0.32. Given 1,000 randomized median overlaps, only 7 times were the randomized medians greater. Thus, the empirical p-value is 0.007, indicating strong support for enrichment of observed + verified proteins in the complexes where the unverified proteins are found. Running the same test on sample N2 also reveals similar results with a p-value of 0.008.
Hence, there is some evidence that these unverified proteins belong to some tissue-specific complex variant absent in the tissue sample. But this evidence may also be a bit circular in the sense that the FCS p-value of a protein is correlated with a high fraction of the complex’s member proteins being present as well.
This tells us that incorporating all complexes simultaneously without regard for their tissue specificity or the presence of other same complex family members is giving rise to a large proportion of unverified proteins. It also suggests that we may be underestimating the verification rates of our predicted missing proteins as we are predicting proteins that should not be in the tissue in the first place. This finding suggests that perhaps more work should be put into building tissue-specific complexomes for more powerful network analytics.