Obtaining TNBC datasets and preprocessing
We downloaded the clinical information of 360 TNBC patients and the RNA-seq raw data, corresponding to each primary tumor samples, from the sequence read archive database (SRA).The clinical information includes age, relapse-free survival (RFS), and clinical stage.
Construction of immune cell expression matrix
To assess the relative proportion of the 22 types of infiltrating immune cells, we used CIBERSORT (http://cibersort.stan-ford.edu/), a deconvolution-based online tool [9]. This tool enables the use of gene expression data to estimate the cellular composition of complex tissues and to quantify the abundance of immune cells and other specific cell types. In addition, we also used the LM22 characteristic matrix to calculate (after removing the low-abundance immune cells with a median ratio of less than 1%) and perform subsequent analysis only on the remaining immune cells.
K-means clustering for TME-infiltrating cells
K-means clustering is often used in datasets where the category attributes are unclear, and it is desirable to be able to mine or automatically classify objects with similar characteristics through data mining [10]. In this study, to identify the heterogeneity of different individual patients, the patient's tissue subpopulation was initially analyzed by K-means clustering, which was done using the R package factoextra.
Pathway enrichment analysis
To understand the cause of the phenotypic differences from a bioinformatics perspective, we used the gene set variation analysis (GSVA) and gene-set enrichment analysis (GSEA) [11, 12]. GSVA is a method of enrichment of gene sets that estimates changes in pathway activity in different samples in an unsupervised manner. Here, we estimated the difference in pathways between the cluster A and cluster B of TNBC. First, we obtained the reference gene set from the Broad Institute's MSigDB database (software.broadinstitute.org/gsea/msigdb/), and then used the limma package to set the difference standard as P < 0.05, the false discovery rate (FDR) was greater than 0.2 to screen the differential genes [13]. Next, the GSVA package was used to carry out the differential pathway enrichment. Finally, GSEA analysis by cluster Profiler R package was used to explore whether the aberrant pathway activations are also consistent with GSVA results.
Statistical analysis
The immune infiltration ratio was calculated using the online tool CIBERSORT. Patients with a P value of < 0.05 were used for subsequent analysis. The Kaplan-Meier survival curves of the TNBC subtype and recurrence-free survival data (RFS) were plotted using the "survival" function, and the differences between the two subtypes of RFS were evaluated using log-rank test analysis. The Wilcoxon rank-sum test was used to detect the expression of different clusters in the 22 immune cells. The GSVA package was used to annotate the possible pathway differences between different clusters. All data were statistically analyzed using R (version 3.6.1) software. P < 0.05 was considered statistically significant, and all the statistical tests performed were two-sided.