1. Platelet collection for BC diagnostics and data processing
Circulating platelets are marked by unique RNA splicing capabilities, rendering them a valuable set of biomarkers. Our focus lies in identifying the most pertinent and distinctive spliced RNAs showing alterations in non- and BC cases. To achieve this goal, we systematically obtained blood samples from 499 subjects who had not undergone tumor resection surgery while having two unique post-mastectomy blood specimens. All individuals in the study cohort underwent rigorous clinical diagnostics, including comprehensive pathological analyses. We excluded samples with significant genomic contamination and severe degradation before sequencing. The analysis finally included 280 samples, including 183 cases of BC, 95 benign breast diseases, and 2 post-mastectomies. Benign breast diseases included a wide range of conditions, such as benign breast tumors (n = 44), breast inflammations (n = 17), and sclerosing adenosis (n = 34), not covering BC. These pathologic conditions collectively form the symptomatic control group in our study, despite representing different diseases and severity levels. Additionally, we united platelet RNA-sequencing data available in the GEO database from the study conducted by Best et al. [10]. This dataset comprises 217 samples from healthy female individuals, referred to as the asymptomatic control group due to the absence of reported signs of cancer or other severe illnesses, and 93 from patients with BC.
Following the approach described by Myron G. Best et al., we implemented a quality control program enabling the identification of 5111 high-confidence transcripts (reads of >30 in over 10% of the samples). Three platelet samples demonstrating low inter-sample correlations (<0.5) compared to all the others were eliminated. Additionally, we ensured that transcript numbers of >1500 were found in each sample resulting in a comprehensive total cohort of 587 samples (see Supplementary Fig. 1, Supplementary Table 3)[12]. Data processing was performed to mitigate batch effects between the two data sources and within them. Figure 1 shows the study design. Within the pre-surgery sample dataset (n = 585), the proportions for each group composed 47.18% (BC, n = 276), 36.58% (asymptomatic controls, n = 214), and 16.24% (symptomatic controls, n = 95). These proportions were used to create the internal subgroups representing training, validation, and testing cohorts. Notably, tumor stage and molecular subtyping in some patients remained unknown. Table 1 shows the detailed baseline characteristics of the participants.
2. Differential platelet RNA profiles in BC and healthy individuals
A comparative analysis between BC cases and the healthy controls in the training dataset demonstrated substantial modifications in the transcriptome of TEP. Out of the assessed 5111 RNAs, 1181 exhibited upregulation and 1207 demonstrated downregulation in BC cases (log CPM≥ 3, P < 0.001; Figure 2a). Additionally, the number of highly confident RNAs found in BC (median value 5001) exceeded that in healthy females (median value: 4791) (P = 1.06e-07). An unsupervised hierarchical clustering heatmap received after differential analysis (Figure. 2b) indicated a distinct difference in TEP-RNA profiles between BC and healthy donors.
We performed an enrichment analysis of these differentially expressed gene sets with the Gene Ontology database to enhance our comprehension of biological significance [13]. The outcomes demonstrated the enrichment of upregulated TEP RNAs in the biological processes, such as regulation of protein-containing complex assembly, blood coagulation, and platelet activation. Conversely, downregulated RNAs were predominantly associated with processes such as ribonucleoprotein complex biogenesis, RNA splicing, and cytoplasmic translation (Supplementary Table 1).
However, the differences in expression were determined in only 120 RNAs (log CPM ≥ 3, P < 0.05) between BC and the symptomatic control group with non-BC conditions. This could be associated with the heterogeneity caused by certain types of benign breast tumors and condition severity.
3. Model development and validation for BC diagnostics
Subsequently, we proceeded to construct a model using a support vector machine (SVM) algorithm. Figure 3 outlines model elaboration and optimization workflow. We included symptomatic controls in model processing, considering the prevalence of benign breast diseases among women, where approximately 50% of those aged >30 years experience mastalgia and fibrocystic changes [14]. We initially performed differential analysis on the training dataset between the BC group (n = 138) and the disease-free group (n = 154) using the R package edgeR to detect the required input gene list for the classifier, which resulted in 1761 genes (log CPM≥ 3 and P < 0.001). We further narrowed the total number down to 57 genes by LASSO modeling to minimize the number of markers in the final classifier. Single-factor logistic regression analysis performed on these 57 genes revealed four genes with an area under the curve (AUC) of >0.8 (refer to Supplementary Table 2), forming the basis of the 4-BC-TEP-RNA panel. This panel demonstrated high diagnostic performance in the training cohort (AUC = 0.894, 95% confidence interval [CI]: 0.857–0.931, sensitivity: 94.2%, specificity: 74%, Figure 4a). The classifier revealed high sensitivity in distinguishing BC (n = 55) from non-BC (n = 61) in the validation cohort (validation cohort: AUC = 0.861, 95% CI: 0.787–0.935, sensitivity: 96.4%, specificity: 73.8%, Figure 4a). The specificity was 100% for asymptomatic controls, but decreased to only 15.79% for the opposite group, resulting in the overall reduction in the model’s specificity (Figure 4b).
We extracted the top seven genes representing the RNAs differing between BC and the symptomatic control group in a pursuit to improve the overall classifier performance (single gene AUC > 0.65, see Supplementary Table 2). These genes were integrated to complement the 4-BC-TEP-RNA panel. The diagnostic model constructed with the optimized 10-BC-TEP-RNA panel demonstrated an AUC, sensitivity, and specificity of 0.97 (95% CI: 0.954–0.986), 88.4%, and 92.9% in the training cohort and 0.941 (95% CI: 0.91–0.979), 87.3%, and 88.3% in the validation cohort, respectively (Figure 4c, 4d). The AUC of this classifier had significantly higher values than the 4-BC-TEP-RNA panel in both training (P = 2.685e-06) and validation cohorts (P = 6.139e-04).
4. Test of the TEP-derived RNA panel model in an independent cohort
We evaluated the efficacy of two TEP-derived RNA panels using the independent test cohort composed of 83 BC and 71 non-BC samples. Applying a 10-BC-TEP-RNA panel for diagnostics yielded an AUC of 0.957 (95% CI: 0.931–0.983), which confirmed significantly superior diagnostic performance compared to the former (P = 1.578e-05) (Figure. 5a). This enhancement aimed to refine the distinction between BC and benign breast lesions by further correcting samples initially identified as positive by a 4-BC-TEP-RNA panel. Furthermore, the specificity was 100% (n = 65) in the asymptomatic control group and 62.1% (n = 29) in the symptomatic control, respectively (Figure 5b). The values for detection accuracy for different tumor stages ranged from 80% to 94.6% (80% [n = 5], 83.3% [n = 12], 94.6% [n = 37], 88.9% [n = 9], 83.3% [n = 18], and 100% [n = 2] for 0, I, II, III, IV, and unknown stages, respectively, Figure 5c). Table 2 shows other detailed parameters.
Of interest, within the complete cohort, two patients who underwent mastectomy provided blood samples approximately one week after breast tumor resection. One patient, initially diagnosed with invasive breast carcinoma, exhibited no evidence of residual cancer upon a pathologic examination conducted a week later. The 10-BC-TEP-RNA panel in this case assigned a probability of 0.216 for BC (<0.5). Conversely, the second patient, diagnosed with moderate-grade ductal carcinoma in situ during the pathologic assessment, received a classifier-assigned probability of 0.858 (>0.5) for BC. Remarkably, the classifier’s results aligned with the former pathologic diagnoses in both instances.
5. Development of classifiers for various receptor subtypes in BC
The advanced therapeutic approach for each patient with BC depends on tumor subtype, cancer stage, and patient preferences. For instance, the strategy for patients having no metastases is determined by tumor receptor subtype: patients with hormone receptor (HR)-positive tumors are recommended to undergo endocrine treatment; patients with human epidermal growth factor 2 (HER2)-amplified tumors received targeted therapy or a combination of small molecule inhibitors and chemotherapy; combined HER2/ER blockade effectively treated HR-positive/HER2-amplified BC; and patients with triple-negative tumors are eligible for only chemotherapy [15, 16].
Hence, we aimed to estimate the use of TEP-RNA profiles in distinguishing different receptor subtypes. Despite the limited sample size and the influence of other potential confounding factors, our developed TEP-RNA profiles-based panels successfully detected HER2-amplified and HR-positive subtypes in BC (compared to a random classifier, P < 0.01, Figure 6).