The highlight of this study is that we integrated five types of features including protein complexes, protein stoichiometric ratios, pathways, network degrees, and proteins, rather than using purely individual proteins, to build machine learning models for disease classification. Twenty-five predictive markers were identified to stratify COVID-19. Our work demonstrates that integrating protein expression levels with protein context improves COVID-19 patient stratification.
The 25 features highlighted by our analysis are all associated with the pathogenesis of COVID-19. As shown in Figure 4, after the SARS-CoV-2 enters the alveolar, the macrophages subsequently phagocytose the virus and release cytokines, resulting in the release of acute phase proteins (APPs) from the liver 20. These APPs stimulate the complement system response 21. However, in severe cases, the complement system reacts abnormally, which can potentially trigger a cytokine storm 22,23. On one hand, cytokine storm leads to multi-organ damages, such as damages to the liver and testis 24. On the other hand, more macrophages are recruited from the peripheral blood to the lungs, causing alveolar macrophage infiltration, lung damage, and respiratory failure 25.
Several studies have reported predictive blood markers for severe cases, such as ITIH4 26,27, M-CSF, CCL3 and CCL4 28, as well as CMAs 29 . Studies utilizing MS-based proteomics also have found that proteins associated with complement system, acute phase protein response, inflammation system, macrophage dysregulation, antibody response, and coagulation system are altered in severe COVID-19 cases 3, 6, 7, 16, 30, which have also been confirmed by other proteomic approaches 31-33. In this study, we found a complex, two pathways, seven proteins and one network degree are involved in the complement system, acute phase proteins and inflammation, including “SAA1, SAA2, YLPM1”, “complement activation”, “acute-phase response”, IGHG3, SAA1, SAA2, IGLV1-47, C9, ITIH4, C4BPA and IGHV3-73 (Figure S1). In addition, one pathway (phagocytosis, engulfment) associated with macrophage dysregulation was identified as a key feature. Our data uncovered previously hidden COVID-19-associated proteome context information.
Our study also identified other molecular features in severe patients. Several transport proteins were upregulated. Vitamin D-binding protein (GC) enhances the activity of C5a in the complement system 34, which may induce cytokine storms. MyRIP, another transport protein, participates in melanosomes and produces pigmented melanin to skins 35. The upregulation of MyRIP may be related to skin hyperpigmentation in severe patients 36. Transthyretin (TTR) is a marker for inflammation and a negative acute-phase reactant. Reduced TTR has been reported to be associated with acute-phase response induced by inflammation, and TTR is also a malnutrition marker, suggesting nutritional disorders in severe cases 37. TTR-RBP complex consists of TTR and retinol-binding protein 4 (RBP4), and the upregulation of TTR-RBP complex suggests an improved inflammation state 38. In this study, both TTR and TTR-RBP complexes decreased in the sera of severe cases, suggesting a more intense acute response and inflammatory state.
Notably, some proteins associated with the complement system were also altered in severe cases. Abnormal response of complement system can trigger cytokine storm, which can further develop into severe cases 22, 39. Carvelli et al. found that C5 was the main effector of abnormal complement system, and blockade of C5 could prevent excessive lung inflammation 40. Complement protein C3 was also associated with fatal outcome of COVID-19 23. Different from previous studies, we found that C9, another protein in the complement system, was elevated in severe cases, suggesting that it may also be a marker or potential therapeutic target. In addition, GC, which activates the activity of C5 34, was also upregulated. C4BPA associated with C4 activity was abnormally expressed 41 (Figure 4). In addition to changes in proteins associated with complement system, RPIA was downregulated in severe cases, which may indicate an impaired glucose metabolism and liver damage. The Tudor domain-containing protein 1 (TDRD1), which plays a central role in spermatogenesis 42, was also downregulated, which may contribute to impaired testis functions observed in severe cases 24.
In addition to proteins, other types of protein context feature further shed light on the mechanism of severe COVID-19 cases. Cytolysis pathway is induced after viral infection and serves as a clearance mechanism for infected cells 43. The alteration of the phosphatidylcholine binding pathway may contribute to the inflammatory process 44. The increased ratio of SAA2/YLPM1 in the "SAA2, SAA1, YLPM1" complex in severe cases may be due to upregulation of SAA2 (sera amyloid A-2 protein) and downregulation of YLPM1 (YLP motif-containing protein 1, Figure S1), revealing an acute-phase response and an enhanced repair of inflammation-induced telomere shortening 45. The network degree changes of some proteins were associated with cytokine storm. Immunoglobulin heavy variable 3-73 (IGHV3-73) participates in antigen recognition 46. MTTP stimulates phosphatidylcholine transport 47. Alpha-2-macroglobulin (A2M) influences cytokines signaling 48, and SIRT7 suppresses inflammation 49. Since network degree suggests the co-expression associations with other proteins, the network degree changes of these proteins also uncorvered systematic molecular changes in severe cases. Our study showed that the predictive result of the model with five different features was better than that of the model with one single feature (Figure 3B), suggesting the benefits of integrating multiple protein context in disease prediction and stratification.
Some limitations of this study should be noted. There were missing features in the two test sets. Seven features were not included in the TMT data, and nine features were excluded in the German cohort data. Median value of all the valued features were used to impute these missing features. The sample size of the training set is limited. Nevertheless, the model achieved satisfactory AUCs in these independent tests. Neither these limitations compromise the major conclusion of this study that integrating protein context information improves COVID-19 severity classification. Moreover, the protein complex information was obtained from cellular complexes, meaning that not all the complexes are necessarily formed in the serum, which needs to be verified by future research. Finally, building ratios may create an overfitting danger, but this can be avoided by building models with other types of features together.