Study design and patient enrollment
The overall design of this study is shown in Figure 1. The discovery cohort was a retrospective cohort. We selected 38 patients who had undergone a physical examination conducted at the People's Liberation Army (PLA) General Hospital from April to July 2015. The inclusion criteria were: 1) Patients who were asymptomatic or had stable symptoms within one year after onset of acute coronary syndromes; 2) Patients who were stable more than one year after initial diagnosis or revascularization regardless of symptoms; 3) Patients with angina pectoris, suspected vasospasm, or microvascular disease; and 4) Asymptomatic patients screened for CHD. The exclusion criteria were: 1) Patients with severe heart failure; 2) Patients with CHD in the acute phase; and 3) Patients with other diseases that made them unsuitable for this study. According to whether MACEs (including cardiovascular related death, non-fatal myocardial infarction, unstable angina, and heart failure) occurred until June 2019, the participants were divided into a CCS events group (Group A, n=19) as the cases and a non-events group (Group B, n=19) as the controls for the proteomics analysis. The validation cohort was an independent prospective cohort that included 352 patients who were recruited from those who had undergone a routine physical examination at the PLA General Hospital from April to July 2017. The inclusion and exclusion criteria were the same as those for the discovery cohort. This cohort was followed up until April 2021. This study was approved by the Ethics Board of the Chinese PLA General Hospital and written informed consent was obtained from each patient.This study conforms to the principles outlined in the Declaration of Helsinki.
Blood Sampling
Blood samples were collected after fasting 12 hours. The samples were stored at −80°C with ethylenediaminetetraacetate until analysis.
Proteomics analysis
Quantitative proteomics analysis was performed using liquid chromatography tandem mass spectrometry (LC-MS/MS) to identify potential protein biomarker candidates among those proteins differing in abundance between event group and non-event group.
We used an integrated approach involving DIA strategy, HPLC fractionation to quantify the dynamic changes of the whole proteome.
Protein Extraction
Firstly, the cellular debris of plasma sample was removed by centrifugation at 12,000 g at 4 °C for 10 min. Then, the supernatant was transferred to a new centrifuge tube. The top 12 high abundance proteins were removed by Pierce™ Top 12 Abundant Protein Depletion Spin Columns Kit (Thermo Fisher). Finally, the protein concentration was determined with BCA kit according to the manufacturer’s instructions.
Trypsin Digestion
For digestion, the protein solution was reduced with 5 mM dithiothreitol for 30 min at 56 °C and alkylated with 11 mM iodoacetamide for 15 min at room temperature in darkness. The protein sample was then diluted by adding 100 mM TEAB to urea concentration less than 2M. Finally, trypsin was added at 1:50 trypsin-to-protein mass ratio for the first digestion overnight and 1:100 trypsin-to-protein mass ratio for a second 4 h-digestion.
HPLC Fractionation
The tryptic peptides were fractionated into fractions by high pH reverse-phase HPLC using Agilent 300Extend C18 column (5 μm particles, 4.6 mm ID, 250 mm length). Briefly, peptides were first separated with a gradient of 8% to 32% acetonitrile (pH 9.0) over 60 min into 60 fractions. Then, the peptides were combined into 18 fractions and dried by vacuum centrifuging.
Data-independent Acquisition (DIA)—LC-MS/MS Analysis
The iRT kit was added to all the samples according to manufacturer’s instructions. The tryptic peptides were dissolved in solvent A (0.1% formic acid, 2% acetonitrile), directly loaded onto a home-made reversed-phase analytical column (25-cm length, 100 μm i.d.). Peptides were separated with a gradient from 4% to 32% solvent B (0.1% formic acid in 90% acetonitrile) over 114 min, and climbing to 80% in 3 min then holding at 80% for the last 3 min, all at a constant flowrate of 450 nL/min on an EASY-nLC 1200 UPLC system (Thermo Fisher Scientific). The separated peptides were analyzed in DDA mode by Q ExactiveTM HF-X (Thermo Fisher Scientific) with a nano-electrospray ion source.
The separated peptides were analyzed in Q ExactiveTM HF-X (Thermo Fisher Scientific) with a nano-electrospray ion source. The full MS scan resolution was set to 120,000 for a scan range of 385–1200 m/z. The data acquisition was performed in DIA mode. Each cycle contains one full scan followed by 70 DIA MS/MS scans with predefined precursor m/z range. The HCD fragmentation was performed at a normalized collision energy (NCE) of 27%. The fragments were detected in the Orbitrap at a resolution of 15,000. Fixed first mass was set as 200 m/z. Automatic gain control (AGC) target was set at 5E5.
Data Analysis
Spectral library generation: The resulting DDA data were processed using MaxQuant search engine (v.1.6.6.0). Tandem mass spectra were searched against the human SwissProt database (20387 entries) concatenated with reverse decoy database. Trypsin/P was specified as cleavage enzyme allowing up to 2 missing cleavages. The mass tolerance for precursor ions was set as 20 ppm in First search and 4.5 ppm in Main search, and the mass tolerance for fragment ions was set as 0.02 Da. Carbamidomethyl on Cys was specified as fixed modification. Acetylation on protein N-terminal and oxidation on Met were specified as variable modifications. FDR was adjusted to < 1%. The false discovery rates of the PSMs and proteins were set to less than 1%.
DIA data analysis: All DIA data were analyzed in Skyline (v 4.1.0). The DDA search results were imported to Skyline to generate the spectral library, and the retention times were aligned to iRT reference values. Transition settings: precursor charges were set as 2, 3, 4, 5, ion charges were set as 1, 2. The ion match tolerance was set as 0.02 Da. Six most intense fragment ions from the spectral library were selected for each precursor. Decoy generation was based on shuffled sequences, and the FDR was estimated with the mProphet approach and set to 1%. Relative quantification of proteins was performed using MSstats package.
Bioinformatics analysis
For the Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses, we used the two-tailed Fisher’s exact test to determine the significance of the functional enrichment of the DEPs against all the identified proteins. A corrected p-value <0.05 was considered significant. Gene Set Enrichment Analysis (GSEA) is an aggregate score and running-sum statistic approach for molecular signature-based statistical significance testing[11]. The entire gene set containing a ranked list of all expression values in a dataset is considered without requiring a cutoff of differentially expressed values for functional analysis. For the GSEA, we used the entire proteomics abundance profiles dataset. Gene sets from the Molecular Signatures Database (MSigDB) v7.2 were used (H: hallmark gene sets; C2: KEGG pathway database; C5: GO terms database).
Machine learning-based selection of biomarkers
Construction of voting classifier
We used three machine learning classification algorithms, Logistic Regression, Support Vector Machine, and Random Forest, as the base classifiers (Supplementary figure S1). On the basis of these classifiers, we built a voting classifier. When a new sample had to be assigned to a category, each base classifier was used to predict the probability that the new sample belonged to a particular category. The final classification result was determined by the weighted value of the predicted probability of each category by all three base classifiers. This is an integrated method that used the Voting Classifier model in Python.
Feature ranking
Each sample was represented by a feature vector composed of numerous expression data. To quantify the ability of these expression features to distinguish different samples, we performed univariate feature analysis using a variance test to calculate the correlation between each feature and the sample category one by one. In this way, the ability of each feature to distinguish the sample category was obtained and the score and the corresponding p-value was calculated. The expression features were sorted according to the calculated p-value, and used in the subsequent analysis.
Accuracy evaluation index
To compare the difference between the category predicted by the model and the actual sample category, we calculated an accuracy index using the Matthews coefficient value as an indicator of the accuracy of the predictive power of the model as
Matthews coefficient=(TP×TN−FP×FN)/√((TP+FP)(TP+FN)(TN+FN)(TN+FP))
where, TP, TN, FP, and FN are true positive, true negative, false positive, and false negative, respectively.
Feature selection
We plotted the calculated accuracy index results against the number of features in the expression feature subset as the incremental feature selection curve. When the Matthews coefficient reached the maximum value, we considered the expression feature subset corresponding to the model as the optimal expression feature subset.
Cross-validation
The expression feature subset and sample category were used as input of the Voting Classifier model, and 10-fold cross-validation was used to calculate the prediction accuracy of the local optimal expression feature subset for a sample. The results were expressed by the receiver operating characteristic curve (ROC). This is a dynamic validation that reduced the impact of data partitioning[12].
Statistics analysis
Data are presented as numbers and frequencies for categorical variables and as means ± standard deviation for continuous variables. Baseline characteristics were compared using the chi-square test for categorical variables and analysis of variance test for continuous variables. The effect of the candidate biomarkers was evaluated using a Cox proportional hazards model, and p-values were calculated using the log-rank test. Kaplan-Meier plots were used to compare the cumulative risk. The ROC was compared using a z test (DeLong’s method[13]) between the classic Framingham CHD risk model alone and the Framingham CHD model combined with the candidate biomarkers EPCR and CETP. The Framingham CHD model included age, sex, total cholesterol, high-density lipoprotein cholesterol (HDL-C), systolic blood pressure, current smoking, and diabetes status as confounding factors[14].
A two-tailed p-value of <0.05 was considered to indicate statistically significant difference for all the analyses. The statistical analyses were performed using SPSS (version 23.0), STATA (version 12.0), and MedCalc Software.