Plasma proteomic profiling of pan-cancer patients discovers biomarkers of cancers


 Circulatory system proteins play a central role in human physiology; therefore, profiling plasma proteins has been explored widely as a robust and dynamic tool in studying diseases. Here, we investigated the proteomic profile of plasma in 1,118 pan-cancer patients. The integrative analysis uncovered diverse molecular characteristics of different cancer types and bridged the plausible connection with the clinical features, including tumor stages. The findings demonstrate that the balance between lipid and glucose metabolisms is important in regulating immune infiltration. The pre- and post-surgery comparison of the plasma proteome indicated that tumor-induced angiogenesis is overexpressed in the tumor, and the differences between the proteome patterns could monitor post-surgery therapeutic effects. Finally, we developed a panel of tumor-type-specific proteins to classify tumor types with >95% sensitivity/specificity. Collectively, this study portrayed the plasma proteome landscape of human cancers and screened reliable biomarkers that could assist in diagnostic and drug discovery efforts.

The pan-cancer plasma proteome analysis was performed by high-resolution liquid chromatography-mass spectrometry (LC-MS/MS) using an orbitrap Fusion Lumos mass spectrometer. The stability of the mass spectrometry (MS) platform was assessed by quality control (QC) runs. A spearman's correlation coefficient was calculated for 50 quality control (QC) runs using HEK293T cell samples ( Figure S1C). The average correlation coefficient of the QC samples was 0.91 (range, 0.84-0.97). Processed data are shown in Table S1 (raw data available via Data Resources in Methods). The number of proteins (at a peptide-and protein-level false discovery rate (FDR) of 1%) in each sample ranged from 1,155 to 4,247 ( Figure 1C) and was significantly higher in tumor sample plasma (median, 2,381) than that in NTDP (median, 2,356) (t-test, p-value < 0.0001, Figure S1D). In total, 10,946 proteins were detected in the 1,318 samples, among which 9,308 proteins were common between tumor-and NTDP samples, whereas 1,535 and 103 proteins were specific in tumor-derived plasma (TDP) and NTDP, respectively ( Figure 1D; Table S1). On average, about 2,100 proteins were measured in each tumor type ( Figure 1C; Table S1), and 4,267 proteins detected at least in one sample were present in all 19 tumor types ( Figure 1E). In this study, the individual plasma proteomes were characterized by their relative abundance, not by the simple presence or absence of proteins. Proteome quantification was conducted using the intensity-based absolute quantification (iBAQ) algorithm, followed by the fraction of total (FOT) normalization as reported previously. The dynamic range of proteins detected spanned eight orders of magnitude ( Figure 1F). Principal component analysis (PCA) revealed a clear distinction among 18 tumor-derived plasma proteomes, indicating that protein variation between tumor types exceeds the variation between individuals (Figure 1G).
Searching the identified proteins (10,946) in several databases (Table S1) revealed several functional groups, including 3,237 drug targets, 406 transcription factors (TFs), 45 cytokines, 2,322 enzymes, 356 kinases, 66 growth factors, 32 hormone activities, and 92 G protein receptors (Figures 1H, 1I, S1C, and S1D). The distributions of these proteins in 18 tumor types are shown in Figures S1E and S1F. Collectively, these data, indicating sufficient coverage for analysis of biological processes, represent a comprehensive plasma proteomic resource for the tumor study.

Proteomic classification revealed heterogeneity of pan-cancer
To investigate the intrinsic structure of the tumor plasma proteomic data, non-negative matrix factorization (NMF)-based unsupervised clustering was performed on the most variable proteins (proteins with the top 3,000 coefficients of variation (CV)). The six NMF clusters (C1-C6) were identified (Figures S2A and S2B), comprising 96, 490, 66, 80, 133, and 253 cases.
To further explore the biology associated with the proteomic taxonomy, we performed consensus clustering using the ConsensusClusterPlus (CCP) package based on differentially expressed proteins (DEPs) (Kruskal-Wallis, adjusted p-value < 0.05, We then accessed the enrichment of the clinical annotations in different clusters and found the significant connections between particular cluster types and clinical information. As shown in Figure 2B, the Gastrointestinal-rich cluster was dominated in TNM stage IV, indicating poor clinical outcomes. Furthermore, STAD and CRCA, the highly enriched tumor types in this cluster, were all annotated as TNM stage IV types (Chi-square p-value = 0.05, Figures 2C and S2D). Consistently, the TNM stage IV of STAD and CRCA were remarkably enriched in Gastrointestinal-rich cluster than in other clusters ( Figure 2D, Chi square p-value = 0.03), indicating that the Gastrointestinal-rich cluster represented a terminal stage of the gastrointestinal tumor patients.
Gene set enrichment analysis (GSEA) 14 for GO biological processes using the signal to noise value between the STAD/CRCA in the Gastrointestinal-rich cluster and other clusters revealed acute phase response (APR) was upregulated in the Gastrointestinalrich cluster (Normalized enrichment score = 2.39, FDR = 0, Figure 2E). Notably, several APR markers such as CRP, SAA1, and SERPINA1 had the highest expression level in the Gastrointestinal-rich cluster (Ranksums p-value<0.01, Figures 2F and 2G). CRP is the most important acute phase reactant. In an attempt to explore the effects of CRP in STAD/CRCA, we performed pair-wise Spearman correlation on CRP and clinical annotations, which revealed a significantly higher correlation with neutrophil percentage (NE%) ( Figure 2H). Moreover, NE% was significantly higher in the Gastrointestinalrich cluster than in other clusters (Ranksums p-value<0.01, Figure 2I). To evaluate the role of neutrophils in tumor metastasis, we applied the Spearman correlation algorithm to NE% and the proteome data in all samples and discovered neutrophil-related metastasis proteins [15][16][17][18] (Figure 2J). The results revealed that the metastasis signature proteins S100A8 and S100A9 were significantly correlated with NE% (Spearman rho>0.3). Besides, we also found that the S100A8/S100A9 were overrepresented in the Gastrointestinal-rich cluster than in other clusters (Figures 2K, 2L, and S2E), indicating S100A8/S100A9 play key roles in the metastasis of the STAD/CRCA. These observations suggested that the CRP-promoted neutrophils recruitment can contribute to metastasis in the Gastrointestinal-rich cluster by increasing the expression level of S100A8/S100A9 and leading to a worse prognosis ( Figure 2M). Figure S3A, the Urinary I-rich cluster was characterized by erythrocyte-related clinical annotations. HGB was increased in this cluster (Kruskal-Wallis, p<0.0001, Figure 3A). In this cluster, BLCA, but not LIHC, showed a significantly high level of HGB than other tumor types (Figures 3B and S3B), which was higher than that in other clusters (Ranksums p-value<0.01, Figure 3C). Interestingly, all BLCA patients classified in the Urinary I-rich cluster were male (Figures 3D and   S2D), and the HGB level was the highest in the male population in the Urinary I-rich cluster than the male population in the other clusters and the female populations (Ranksums p-value<0.05) ( Figure 3E).

As shown in
The GSEA analysis on the BLCA male population between Urinary I-rich and other clusters showed that heme metabolism and HPMP were enriched in the Urinary I-rich cluster than in other clusters ( Figure 3F). Major proteins participating in these processes, such as BLVRB, CA1, CAST, and PRDX2, were upregulated in the Urinary I-rich cluster ( Figure 3G). Further, we calculated Spearman correlation based on the HGB clinical values and process ssGSEA scores to investigate the relation between HGB and the above two biological processes. The results indicated that HGB was significantly correlated with heme metabolism (Spearman rho = 0.419, p-value = 2.18E-5), and the latter was strongly correlated with HPMP (Spearman rho = 0.592, p-value = 8.93E-11, Figures 3H   and 3I), which was consistent with a previous report 19,20 . Here, we estimated Multi-Gene Proliferation Scores (MGPS) to assess the influence of HPMP on cell proliferation, which revealed a negative association between HPMP and MGPS (Spearman rho = −0.314, pvalue = 1.47E-3, Figure 3J) and BLCA patients in Urinary-I rich cluster had a lower MGPS level than in other clusters (Ranksums p-value = 0.03, Figure 3K). Consistently, we found more early T stages of BLCA patients in the Urinary I-rich cluster ( Figure   S3C). Subsequently, to investigate the relation between the HGB and the prognosis of the BLCA, we referred to TCGA BLCA data for survival analysis. Significantly, the level of HBA2 was positively related to better prognosis in males but not in females ( Figures   S3D and S3E). These findings uncovered the linkage between HGB and BLCA ( Figure   3L) and suggested that HGB could be considered a prognostic biomarker for male BLCA patients.
The Urinary II-rich cluster, which has the highest creatinine (CREA) level, consisted of the highest number of Urinary cancers (Kruskal Willis, p = 4.05E-12, Figures 3M and   2A). The elevated level of CREA, a break-down product of creatine, signifies impaired kidney function or kidney disease. To investigate how CREA affects the carcinoma process, we used correlation tests between CREA and Gene Set Variation Analysis (GSVA) scores. As a result, in Hallmark gene sets, we identified pancreas beta cell pathway ranked the top showing strong correlation with CREA (Spearman rho = 0.197, FDR = 2.00E-5, Figure 3N) and was overrepresented in the Urinary II-rich cluster ( Figure S3I), which is consistent with the report that high level of creatine induced insulin secretion 21 . In addition, insulin receptor signaling pathway (INSR, PDK4, RGN, SLC1A2, etc.) that showed higher association with pancreas beta cell pathway (Spearman rho = 0.199, p-value = 2.028E-11, Figure 3O) was activated in this cluster (Kruskal-Wallis p-value = 2.038E-25, Figure S3F). Further analysis revealed that glucose import (Spearman rho = 0.330, FDR = 1.10E-27) / glucose metabolic process (Spearman rho = 0.184, FDR = 6.37E-9) have strong correlation with the insulin receptor signaling pathway ( Figure 3P) and were activated in Urinary II-rich cluster ( Figure S3I).
Insulin-regulated facilitative glucose transporter, GLUT4, HNF1A, and ASPSCR1 were all elevated in the Urinary II-rich cluster (FDR<0.01, Figure 3Q), especially in RCC. In addition, we found lymphocyte count (LY) was elevated in this cluster (Kruskal-Wallis p-value = 3.513E-10, Figure S3G). Therefore, we hypothesized an imbalance between glucose metabolic process and immune infiltration. To confirm this hypothesis, we observed the correlation between the glucose metabolic process and LY (Spearman rho = 0.241, p = 7.659E-14, Figure S3H). Above all, these results highlighted the abnormal level of CREA that leads to the interplay between metabolic reprogramming and immune function linked by insulin signaling (Figure 3R).

Plasma proteome based hierarchical clustering of pan-cancer
To investigate TDP proteome features in different physiological systems, we classified 15 tumor types into six physiological systems, including excretory system (BRCA, CHOL, LIHC, PAAD, and THCA), reproductive system (CECA, OVCA, and TGCT), urinary system (BLCA and RCC), digestive system (CRCA, ESCA, and STAD), immune system (ML) and respiratory system (LC). Consequently, a total of 1,323 (Kruskal-Willis test, p-value < 0.05, the expression of the proteins in a certain system was >2 folds higher than the other, consistently) system-specific plasma proteins were identified, ranging from 94 TDPs in the respiratory system to 347 TDPs in the digestive system. (Figure 4A; Table S3). To reveal the functional characteristics of the system-specific plasma proteins, we performed the Kyoto Encyclopedia of Genes and Genomes (KEGG) function enrichment analysis. The results demonstrated that the excretory system-enriched plasma proteins were significantly enriched in pathways (p-value < 0.05), including mitochondrial translation (such as MRPS35, GFM2, and MRPL1), energy metabolism (such as EP300, TFAM, and RXRA) and endocytic vesicle (such as SNX3, PLD4 and PLD1). The plasma proteins highly expressed in the digestive system were mainly involved in pathways related to linoleate metabolism (such as PLA2G4A, CYP3A5, and UGT1A7), ribosome (such as MRPS7, RPL10, and RPL18), androgen and estrogen biosynthesis and metabolism (such as CYP3A5 and COPE) and proteoglycan biosynthesis (EXT1 and COPE) ( Figure 4B). Collectively, we found the specific system plasma proteins were significantly enriched in the predominant biological processes of the corresponding systems.
Intriguingly, the specific system plasma proteins were also correlated with the TNM stage. We identified 60 specific system plasma proteins based on their significant association with the TNM stage (Kruskal-Wallis test, p-value < 0.05, correlation > 0.2).
For example, IGFBP5, KRT13, KRT15, ORM2, and PSIP1 were the digestive systemspecific plasma proteins, whereas VCAM1 and HLA-B were significantly upregulated in the immune system. There were 11 (C4BPA, CFB, C9, SERPINA1, SERPINA3, CRP, HSPB1, ITH3, PSMB10, RCN1, and CP) proteins, which were correlated with the TNM stages in the urinary system ( Figure 4C). In addition, there were 20, 12, and 10 proteins, which were correlated with the TNM stages in the excretory system, respiratory system and reproductive system, respectively (Figures S4A and S4C). These findings suggest that identifying specific system plasma proteins associated with TNM staging could disclose the mechanism in the tumor progression and nominate the biomarker to identify the tumor stages.
The hierarchical clustering showed that patient samples from the same physiological system tended to cluster together, indicating more similarity in tissue-originated tumor type ( Figure 4D). Consequently, we identified four clusters (subtypes) based on the correlation between the 19 tumor types. A total of 1,124 (Kruskal-Willis, p-value < 0.05, the expression of the proteins in the certain subtypes was > 2-folds higher than the other, consistently) most variable proteins, ranging from a minimum of 205 proteins in cluster 2 to a maximum of 455 proteins in cluster 4 were identified. The clusters revealed a clear co-clustering among the samples from the GI tract (colorectal, stomach, and esophageal) and the urinary organs (renal and bladder) ( Figure 4D, Table S3). Furthermore, KEGG analysis of these 1,124 differentially expressed proteins revealed that cluster 1 included To further elucidate the underlying differential characteristic of these four clusters, we investigated their association with clinical indices. Serum total bile acid (TBA) level, a sensitive marker of liver function, is used to diagnose and monitor various liver diseases 22 . In our dataset, the TBA level was overrepresented in cluster 3 (Kruskal Willis P-value = 2.2e-06, Figure S5A), and the LIHC in this cluster showed the highest TBA level (t-test P-value = 0.0009, Figure S5B).
To illustrate the mechanism of the high level of TBA in LIHC, we performed a Spearman correlation algorithm on TBA and global proteome data and ORA on  Figure 4F), and TGCT showed an increased ALB level than other tumor types in this cluster (Kruskal-Wallis p-value < 0.0001, Figure 4G). Serum ALB levels have been shown to have prognostic significance in cancers 23 . To illustrate the reason for higher-level ALB in TGCT, we firstly examined the pathways related to ALB biosynthesis. The branched-chain amino acid metabolism has been reported to promote ALB biosynthesis 24 ; however, we did not find the upregulation of this pathway in the TGCT, indicating the high level of ALB might not be because of the superfluous biosynthesis ( Figure S5H). Further, to unravel the possibilities of another speculation that the blood-testis barrier (BTB) can influence the transport of ALB and tight junction plays an important role in BTB formation 25 , we performed the Spearman correlation analysis between ALB levels and pathway GSVA scores. The results demonstrated a strong correlation of the adhesion-related pathways with ALB ( Figure 4H), and the cell junction assembly was consistently activated in TGCT (Kruskal-Wallis p-value = 1.7E-7, Figure 4I). RAC signaling was reported as the upstream of cell junction assembly 26 .
We also found a significant positive correlation between cell junction assembly and RAC protein signal transduction in our cohort (r = 0.24, p-value = 2.7E-16, Figure 4J).
Moreover, the cell junction assembly related proteins, including TRIP6, ARVCF, S100A10, STMN1, RAB29, NEBL, CNN2 were also found to be strongly correlated with RAC protein signal transduction (Figure 4K), and these proteins showed the highest level in TGCT than in other tumor types in cluster 4 ( Figure 4L and S5G). Collectively, we inferred that the RAC signaling might hyper-stimulate the BTB formation by increasing the activation of cell junction assembly and thus, preventing the ALB transport in TGCT ( Figure 4M).

Identification of specific tumor-derived plasma proteins
Several blood proteins, such as CEA (colorectal cancer and pancreatic carcinoma) and CA125 (ovarian cancer), have been reported for tumor detection. To explore the role of these conventional blood markers, seven blood markers (CEA, NSE, CA153, CA125, PAP, SCCA1, and SCCA2) were examined in our cohort ( Figure 5A). Specifically, compared to others tumor types, OVCA had the highest expression of CA125, which was consistent with the previous study 27 . The most commonly increased plasma protein markers associated with cancer were CEA (BLCA and PAAD), NSE (RCC, BLCA, ESCA, CECA, and PAAD), CA153 (BLCA, STAD, OVCA, and CECA), PSAP (ML, BLCA, OVCA, LIHC, and CHOL) and SCCA1 and SCCA2 (LC, RCC, and THCA) ( Figure 5B). However, we found that some biomarkers were also highly overrepresented in other cancer types besides the targeting cancer type, such as CA153 in BLCA and STAD. These results indicated the limitation of a single biomarker in identifying specific tumor types. Therefore, we focused on expanding the spectrum of novel blood-derived biomarkers.
To identify the TDP markers, we stratified the proteins into two categories: nonubiquitous proteins (non-ubiPros, 3,107, accounting for 58.8% of the identified proteins), which were highly expressed in only a few tumor types with a transformed median expression value of < 0.5; ubiquitous proteins (ubiPros, 2,179, 41.2%), which were expressed in a wide variety of tumors with a transformed median expression value of > 0.5 ( Figure S6A). We identified 3,107 tumor-specific plasma proteins, ranging from 49 proteins in LC to 330 proteins in CHOL. We then performed KEGG analysis of the tumor-specific plasma proteins and found consistency between the specific plasma proteins with the relative feature of tumor type ( Figure 5C). For example, HIF signaling pathways (MAP2K1, EIF4E) and peroxisome (ABCD1, ECH1, and GNPAT) were enriched in RCC tumors, indicating a significant Warburg effect in RCC tumor, which enables tumor cells to rapidly proliferate and survive under nutrient depletion and hypoxia 28 .
To further connect specific co-expression of the protein groups with different tumor types, we performed weight gene correlation network analysis (WGCNA) 29 ( Figure 5D).
Separating the proteomic profiles to eigengene modules identified a group of modules positively associated with different tumor types (p-value < 0.05, r > 0.92). WGCNA identified 18 protein modules, and the number of proteins in different modules ranged from 49 proteins in module grey60 related to LC to 330 proteins in module turquoise related to CHOL. Moreover, the findings demonstrated an extensive connection between modules with tumor types. For instance, module purple (MAP3K9, NUDCD2, ATL1, ERBB4, etc.) was highly correlated to BLCA (p-value < 1.1E10, r = 0.95). The proteins AQP2, CLCA4, RXRA, MRPL13 comprising module red were strongly associated with PAAD. Further, based on the cut-off criteria (GS > 0.2, correlation > 0.8), we identified the proteins with high connectivity in the significant module as hub proteins, which were also found to be commonly highly expressed in the corresponding specific tumor type.
For example, phosphodiesterase [ENPP3], which has been shown to be involved in tumor cell migration 30,31 , was identified as a hub protein and highly expressed in RCC.
Moreover, SLC27A1, one of the three members of the fatty acid transport protein family, was overrepresented in LIHC. As fatty acids are fundamentally required for tumor cell proliferation to provide new phospholipids for plasma membranes, the increase of the SLC27A1 might suggest a demand for the uptake of exogenous fatty acids. Consequently, the inhibition of exogenous fatty acids uptake might provide novel therapeutic approaches to treat this pernicious tumor type. We classified the top 10 functional hub proteins in the modules with the highest connectivity as candidate markers for further analysis (Top 10 proteins in Figure 5E; complete list in Table S4).
Interestingly, we found some of these hub proteins were associated with the TNM stage. For example, in RCC samples, there were 20 potential biomarkers (Kruskal−Wallis test, p-value < 0.05) positively associated with the TNM stage ( Figure 5F). Gene Ontology (GO) analysis demonstrated that most of these proteins were involved in immune response (SERPINA1/3, ORM1/2, CFI, C2, C9, and LRG1). We next identified 12 potential biomarkers negatively associated with the TNM stage in RCC samples, which were mostly involved in blood microparticles (AFM, ITIH2, and GSN), endomembrane system (ALB, LUM and SERPINA4), and plasma lipoprotein particle (APOM, APOA1) ( Figure 5G). These findings suggest that identifying the potential markers associated with the TNM staging could guide physicians to select appropriate therapeutic schedules.
We reasoned that ideal biomarkers should be overexpressed in the corresponding tumor tissue (upregulated in the tumor tissue sample) and released into the blood  Table S4. Consequently, collating these data, 585 potential biomarkers in eight tumor types were identified-for instance, 153 potential biomarkers in LC were identified ( Figure 5I). Of note, these proteins are involved in several processes, including cell cycle, spliceosome, and transcription-coupled nucleotide excision repair ( Figure 5J). In addition to the potential plasma biomarkers overrepresented in tumor samples, we also identified proteins that were exclusively detected in TDP and generated a list of proteins detected in > 50% of TDP samples but not in NTDP samples. Using a cut-off of > 5-fold upregulation of proteins (t-test, p-value < 0.05) in the tumor tissues than the matched adjacent tissue proteome, 254 core marker proteins were determined totally. For example, 94 core markers out of 153 potential markers were identified in the LC, including EIF3D, MCM6, and RFC5 ( Figure 5K; Table S4) Collectively, these core biomarkers indicated heterogeneity among different tumor types and could be used to discriminate cancer types.

Immune infiltration in pan-cancer tumors
To gain insight into immune features in pan-cancer, we performed cell type deconvolution analysis using xCell 32 based on proteome data to infer the relative abundance of different cell types in the tumor microenvironment ( Figure 6A; Table S5) and used ESTIMATE 33 to infer the immune score of tumor samples (Figures 6A and   6C).  Table S5). Comparing this with tumor types, we observed lower immune infiltration in ML, CRCA and LC and a higher immune infiltration in RCC and BLCA (Figures 6A and 6C).  (Figures 6A and 6B).
The IC3 group was characterized by a high degree of various types of immune cells such as CD8 + Tem, B-cells and CD8 + T-cells, containing BLCA (n=121) and RCC (n=130) plasma samples (Figures 6A and 6B). Moreover, compared with other tumors, the IC3 cluster displayed the highest immune score and upregulation of MAPK activation, PI3K-AKT signaling in cancer and downregulation of lipids metabolism (e.g., SUMF2, ACOX2, PLPP3, CD36) (Figures 6A, 6C, and S9C). This is consistent with recent reports that tumors with high levels of immune infiltration are characterized by lower action of lipids metabolism 34 (Spearman's correlation, r = −0.26, p-value < 2.20E-16) ( Figure 6D). The highest overlap (67%) was observed between the IC3 group and the Urinary II-rich cluster (Figure 6E), and the results agree with those shown in Figures   S3G and S3H. Furthermore, in the IC3 group, we found a higher activation of glucose metabolism (e.g., GSK3B, PGP, PHKG2) (Kruskal−Wallis test, p-value < 0.05) ( Figure   6G), and a significant positive correlation between the pathway scores of glucose metabolism and scores of immune infiltration (Spearman's correlation, r = 0.14, p = 1.70E-06) ( Figure 6F). These results revealed the facilitative and inhibitory effects of glucose metabolism and lipid metabolism on immune infiltration, respectively ( Figure   6H).
IC5, as the group of LC enrichment (n=233), was characterized by upregulation of peroxisomal lipid metabolism, bile acid and bile salt metabolism, and steroid hormone biosynthesis, which showed a high degree enrichment of megakaryocytes and astrocytes ( Figure 6A). In patients with LC, venous thromboembolism (VTE) is a recognized complication and one of the main causes of death 35 . Importantly, we found that megakaryocytes, the precursors of platelets, showed the most enrichment levels in IC5 ( Figure 6I) and exhibited a positively significant correlation with platelet-associated proteins (e.g., SELP, CALM) (Figures 6J and 6K). Identically, IC5 exhibited the highest platelet (PLT) level ( Figure 6L) than other ICs. Corresponding with the previous studies, we found that PLT was positively correlated with Fibrinogen (FIB) and elevated level of FIB in IC5 consistently (Figures 6M and 6N). These findings are consistent with those of the recent reports that show under abnormal FIB, the blood will continue to coagulate, leading to thrombosis 36 . Taken together, the megakaryocyte level associated with diverse clinical features could reveal the molecular mechanism influenced on thrombus production in LC plasma ( Figure 6O).

Short-term changes of the plasma proteome after surgery
Surgery, the operation to remove part of the body to treat cancer, provides a chance to cure many tumors. The perioperative period is characterized by a recovery from cancer status, wherein the side effects and indirect symptoms also affect the physiological status of the body. However, the impact of surgery on the plasma of patients remains unclear.
To clarify the changes, we investigated short-term changes in the plasma proteome during the perioperative period. In our cohort, a total of 101 plasma samples were collected during the perioperative period, including first visit (FV, n=17), pre-operative (Pre-op, n=49), and post-operative (Post-op, n=35). Clinical characteristics, surgery information and plasma proteome data are presented in Table S6.  (Figures 7A, 7B, and 7C). Importantly, we found the heat shock protein family was significantly altered after surgery, including HSPA1A, HSPA1B, HSPA5, and HSPA9. Among these proteins, HSPA1A (p = 0.043), HSPA1B (p = 0.043) and HSPA5 (p = 0.033) were significantly down-regulated after surgery, while the HSPA9 (p=0.0028) was significantly up-regulated ( Figure 7D). Consistent with the previous reports 37, 38 , HSP1A and HSPA9 were identified as unfavorable and favorable prognostic markers, respectively, in colorectal cancer (p-value <0.001).
To discover the markers that indicate the therapeutic effect of the surgery, we focused on the proteins that were identified in the FV and pre-op but dramatically downregulated in post-op. According to these criteria, 193 significant DEPs were identified (Kruskal-Wallis p-value <0.05, Figure 7E). By matching these with the 1,399 TDP specific proteins, 32 were overlapped (Figure 7F). These 32 proteins demonstrated the potential to indicate the detection of tumor or the indication of the surgery recovery status ( Figure   7G). We then accessed the correlation network of these 32 proteins; as shown in Figure   7H, the CDK5 has the largest number of significantly correlated proteins, suggesting the involvement of the CDK5 in carcinogenesis. Further, we investigated the protein expression of CDK5 in plasma proteome derived from healthy and tumor patients and found CDK5 was significantly up-regulated in esophageal cancer (p = 1.768e-4), ovarian cancer (p = 1.919e-5), and pancreatic cancer (p = 3.707e-3) (Figure 7I).
Since CDK5 is a kinase that plays an important role in regulating the nervous system and cell cycle, we investigated the expression of the substrates proteins of CDK5 and identified that four (APP, DPYSL2, EPRS1, PAK1) of 50 substrates were significantly correlated to CDK5 (p-value < 0.01, Figures 7J, 7K, S10A, S10B, and S10C). Moreover, pathway enrichment analysis of the correlated substrates of CDK5 revealed that vascular endothelial growth factor and its receptor (VEGFR2) mediated vascular permeability was significantly enriched (SRC, RAC1, PAK, CTNNB1) ( Figure 7L). VEGF is a key mediator of tumor angiogenesis and a major target for anti-angiogenic therapy for various malignant tumors. These results suggested tumor-induced angiogenesis was overexpressed in the tumor, and its diminishment could be monitored after the surgery ( Figure 7M).

Construction of tumor diagnosis classifier based on proteome data
In this study, using the proteome data of normal plasma as the baseline, we built a reference interval (RI) with the distribution of normal plasma protein abundance, which could be used for abnormal detection if the protein abundance exceeds the upper limit of the RI. For any query sample, an outlier is defined as a protein whose expression level is We classified fifteen tumor types into six physiological systems, and 1,360 system plasma proteins were identified. Intriguingly, the pathway enrichment analysis revealed that the specific plasma proteins were significantly enriched in the predominant biological processes of corresponding systems. Then, we used hierarchical clustering to divide the 19 tumor types into four subgroups based on their correlations. The results showed that samples from the same physiological system clustered together, indicating more similarity in tissue-originated tumor type. The integrated analysis of the clinical indices associated with proteome subgroups is one of the highlights of this study. In our data, plasma TBA level was the highest in cluster 3, and a significant positive correlation between TBA and interleukin signaling was observed. Interestingly, the proteins APOB, CD36, SFTPA1, JAK1, and SDC1 related to interleukin signaling were up-regulated in cluster 3. Hence, we inferred that the metabolic disorder of TBA might be due to the inflammatory response caused by the abnormal activation of the interleukin pathway in the liver. Besides traditional biomarkers, such as SCCA, CEA, and NSE, we also identified 18 protein modules associated with different tumor types, including 616 potential biomarkers in eight tumor types, of which 116 proteins were exclusively detected in the TDP but never found in the healthy samples. Even more exciting, we identified some proteins associated with the TNM stage, such as CRP, C9, and CFB in the urinary system. These proteins associated with the TNM stage may guide physicians to select appropriate therapeutic schedules. In summary, our study further fortifies the understanding of cancer biology, revealing new biomarkers and biomarker panels, and opens a new avenue for more efficient early diagnosis and surveillance of cancer.
In recent years, studies of immunotherapy combined with immune checkpoint inhibitors (ICI) have made a great breakthrough, improving the treatment options for various cancers and increasing the survival rate of treated patients, although heterogeneous response rates were obtained for treated patients with different cancer types and patients affected by a specific tumor type 40  We constructed the prediction model to classify the tumor patients and normal control based on the plasma proteome. We built the reference interval of normal people, trained the linear classifier to predict the tumor types based on the plasma proteome.
Taken together, our findings revealed that the plasma proteome could reflect the diverse molecular mechanisms of different tumor types. The plasma proteins could be used as biomarkers for early-stage cancer detection and treatment response.

Conclusions
In summary, this study extends our biological understanding of plasma-derived proteins of pan-cancer and generates therapeutic hypotheses that may serve as the basis for future preclinical studies and clinical trials toward molecularly guided precision treatment of pan cancer. Meanwhile, we have made the primary and processed datasets available in publicly accessible data repositories and portals, which will allow full investigation of this extensively characterized cohort by broader scientific communities.

Plasma protein extraction and trypsin digestion
The top 14 highest abundant plasma proteins were first depleted from plasma samples before protein extraction using an immunodepleting kit (Thermo Fisher) according to the manufacturer's instructions and then inactivated at 85°C for 10 min. The depleted plasma was digested by trypsin at an enzyme to protein mass ratio of 1:25 overnight at 37°C, and the peptides were then extracted and dried (SpeedVac, Eppendorf).

FFPE Protein extraction and tryptic digestion
Slides ( Precursor ion score charges were limited to +2, +3, and +4. The data were also searched against a decoy database so that protein identifications were accepted at a false discovery rate (FDR) of 1%. The results of DIA data were combined into spectra libraries using SpectraST software. A total of 327 libraries were used as reference spectra libraries.

Mass spectrometry platform QC
The 293T cell (National Infrastructure Cell Line Resource) lysate was measured every three days as the quality-control standard for the quality control of the performance of mass spectrometry. The quality-control standard was digested and analyzed using the same method and conditions as the tumor FFPE samples. A pairwise Spearman's correlation coefficient was calculated for all quality-control runs in the statistical analysis environment R v3.6.3. The results are shown in Figure S1. The average correlation coefficient among the standards was 0.91, and the minimum and maximum correlation coefficients were 0.84 and 0.97, respectively. These quality-control samples demonstrated the consistent stability of the mass spectrometry platform.

Preprocessing of DIA proteomic data and batch correction
Before performing any downstream analysis, we applied type-specific batch correction on global proteome abundance to remove the technical difference. A batch correction was performed for each data type on the subset of proteins with more than 20% observed in at least one subtype. After filtering proteins with missing rates of >20% in each tumor type, there were 5286 proteins with an overall missing rate of 40.9% to 61.5% for the diagnosis-wide global protein abundance datasets, respectively. As the first step to batch correction, we performed KNN imputation separately on the data from each tumor type using the "impute.knn" function from the "impute" R package. After merging the data across tumor type, we then applied the R tool Combat, with the tumor type as a covariate to remove batch effects 45 .

Unsupervised proteome clustering using NMF
We used non-negative matrix factorization (NMF) implemented in the NMF R-package 46 to perform unsupervised clustering of tumor samples and identify proteome features showing characteristic expression patterns for each cluster. Briefly, given a factorization rank k (where k is the number of clusters), NMF decomposes a p x n data matrix V into two matrices W and H such that multiplication of W and H approximates V. Matrix H is a k x n matrix whose entries represent weights for each sample (1 to N) to contribute to each cluster (1 to k), whereas matrix W is a p x k matrix representing weights for each feature (1 to p) to contribute to each cluster (1 to k). Matrix H was used to assign samples to clusters by choosing the k with the maximum score in each column of H. The resulting matrix was then subjected to NMF analysis leveraging the NMF R-package. To determine the optimal factorization rank k (number of clusters) for the proteome data matrix, we tested a range of clusters between k = 2 and 10. For each k, we factorized matrix V using 50 iterations with random initializations of W and H. To determine the optimal factorization rank, we calculated cophenetic correlation coefficients measuring how well the intrinsic structure of the data was recapitulated after clustering and chose the k with maximal cophenetic correlation for cluster numbers between k = 2 and 10.
After determining the optimal factorization rank k, we repeated the NMF analysis using 200 iterations for clustering.

The tumor type distribution of NMF clusters
To determine the distribution of different tumor types in each NMF clusters, we calculated the tumor type enrichment score (TTES) using the following formula: represents the TTES of specific tumor type i in the j NMF cluster. Specifically, !" , , ! , # represents the count of tumor type i in the j NMF cluster samples, the count of j NMF cluster samples, the count of i tumor type samples, and all tumor samples, respectively.

Pathway analysis based on NMF clusters
To better determine the intrinsic proteomic clusters and pathway activation, unsupervised clustering was performed. Based on NMF cluster proteome data, features (proteins) were filtered according to the Kruskal-Wallis test, finally, 3326 proteins with Benjamin and Hochberg adjusted p-value<0.05 across 1118 samples were selected for clustering.
Consensus clustering was performed using the R package ConsensusClusterPlus 47 .
Before clustering, the data matrix was scaled so that each protein had a mean 0 and standard variation 1 across samples and calculated the mean value of each protein based on the NMF cluster. Then, K-means clustering based on the Euclidean distance matrix was conducted across 300 repetitions for cluster numbers ranging from 6 to 20, and other parameters were used as default. To determine the optimal protein groups, we performed an overrepresentation of biological pathway/gene sets in each proteome group using the clusterProfiler R package 48 (Table S2).

Gene set Score for Single Sample
To functional characterized NMF cluster results by single sample Gene Set Enrichment Analysis (ssGSEA), we calculated the normalized enrichment score of each sample based on four class of gene sets: GOBP, KEGG, Hallmark and Reactome gene sets. We utilized R package GSVA with following parameters: min.sz = 10, max.sz = 300 and other parameters were used default.

Multi-Gene Proliferation Scores (MGPS)
MGPS were calculated from the median-MAD normalized proteome data. Briefly, MGPS was calculated as the mean expression level of all cell cycle-regulated genes identified by Whitfield et al. 54 in each sample.

Immune Scores
The immune score was inferred using the R package ESTIMATE v1.0.11 33 . Although the ESTIMATE algorithm was designed to analyze transcriptome data, some studies have used it for proteome analysis 55,56 . The results indicate the feasibility of evaluating the engagement of each subtype of immune cells.

Immune subtype identification
The abundance of 64 different cell types was computed via xCell based on proteomic profiles. Therefore, for this analysis, 1118 tumor plasma samples with proteome data were utilized. Consensus clustering was performed based on Raw enrichment scores of the patients. Based on these 64 signatures, consensus clustering was performed to identify groups of samples with similar immune/stromal characteristics. Consensus clustering was performed using the R packages ConsensusClusterPlus based on z-score normalized signatures. Specifically, 80% of the original pan-cancer tumor plasma samples were randomly subsampled without replacement and partitioned into three major clusters using the Partitioning Around Medoids (PAM) algorithm, which was repeated 2000 times.

Correlation between tumor types and clinical features
To estimate the correlations between tumor types and clinical features, Fisher's exact test was used on categorical variables, and Wilcoxon rank-sum test was used on continuous variables.

Hierarchical clustering
The Hclust function implemented in the R language was used to perform unsupervised clustering of Pan-cancer samples to identify proteomic features of each cluster. Before clustering, we selected proteins as follows: 1) Within each tumor type, proteins were

Screening potential drug targets of pan-cancer samples
Drug targets approved by the FDA or under clinical trials were retrieved from the Drugbank database (version 5.1.5) (http://www.drugbank.ca/). Target proteins that were unregulated in pan-cancer with potential curative drugs were chosen.

Classifier construction
The classifier was implemented using

Consent for publication
All authors give consent for the publication of the manuscript in Molecular Cancer.  See also Figure S1and Table S1. See also Figure S2 and Table S2. See also Figure S3 and Table S2. See also Figure S4 and Table S3 See also Figures S5 and S6 and Table S4. See also Figure S7 and Table S5. See also Figure S8 and Table S6.      A-C. The identification of specific-system plasma proteins associated with TNM staging in excretory system, reproductive system, respiratory system, respectively (Kruskal-Wallis test, p < 0.05).