3.1 | Pre-processing of CTCs and DEA
There are 74 out of 77 cells after pre-processing for downstream analyses. The excluded cells had low quality; therefore, we omited them. The differential expression analysis was implemented after the normalization step. The expression difference between cluster CTCs and single CTCs groups were assessed (FDR < 0.05). The genes of immune subnetwork are downregulated (light purple color in the heatmap) in cluster CTCs compare to single CTCs. Furthermore, the genes of the EMT subnetworks imply upregulation (dark purple color in the heatmap) in cluster CTCs (Fig. 1). The immune and EMT subnetworks were used independently to cluster all cells. As illustrated in Fig. 1, the single and cluster CTC cells separated well with selected subnetworks. The immune-related subnetwork represents a stronger expression difference between cluster cells and single cells.
3.2 | Metastasis associated subnetworks
Metastasis associated subnetworks were determined by co-expression analysis and hierarchical clustering. We detected 16 subnetworks. The first principle component (in PCA analysis) of subnetworks and the trait (cluster CTCs vs. single CTCs) relationship assessed by correlation analysis. The two top subnetworks with the highest correlation with the trait, nominated for Enrichment (midnightblue correlation =0.57, turquoise correlation = 0.51). The midnightblue and turquoise subnetworks sizes are 35 and 22 genes, respectively (Fig. S1). The midnightblue subnetwork enriched for immune responses and the turquoise subnetwork enriched for EMT (Fig. 2).
The EMT subnetwork enriched for cancer related pathways such as cell-cell communication, tight junction, keratinization, estrogen signaling pathway. The immune subnetwork enriched for pathways such as platete activation, immune system, innate immune system.
To have a biological concept for subnetworks, we address the midnightblue and the turquoise subnetworks, the immune and the EMT subnetworks, respectively. The immune-related novel genes are PTCRA, F13A1, LAT, GNG11, ICAM2, NRGN, P2RX1, CLEC1B, BIN2, LPAR5, CCL5, SELP, RUFY1, C6orf25, TUBB1, GFI1B, C2orf88, ACRBP, andC17orf72. The list of genes identified in the immune-related subnetwork is Table S1.
The EMT related genes are LRPPRC, AGR2, CLDN4, CRIP1, DSP, ELF3, JUP, KRT8, KRT18, KRT19, FAM102A, TACSTD2, EPCAM, PEBP1, PSMD8, RAN, SNRPC, SPTAN1, EZR, DDR1, MLPH, and WDR34. In which, SNRPC is a metastatic novel gene in breast cancer that is upregulated in CTC clusters. The list of genes identified in EMT-related subnetwork is summarized in Table S2.
The preservation of all subnetworks was assessed in another dataset (GSE51827). The two combined statistics and calculated to assess subnetworks preservation in the second data set (immune subnetwork: , and EMT subnetwork: , ). The values for both subnetworks are more than 10, and values are high too (min= 7, max= 32). Subsequently, these statistics represent our immune and EMT subnetworks preservation in the second data set (Table S3).
3.2 | Directed network reconstruction
The signaling pathways inside the CTCs are not well recognized. The 336 homo sapiens signaling pathways out of 537 ones in the KEGG database downloaded and merged. The association among two selected subnetworks (immune and EMT) were investigated by extracting induced subnetwork from KEGG. A directed subnetwork of size 255 genes extracted and illustrated in Fig. 3. There are 12 gene categories obtained from biological processes, including Hormonal regulation, Immune responses, Ion metabolism, Nucleobase metabolism, Oxidative responses, Protein localization, Protein topology response, STAT singling pathway, Vitamin metabolism, cell differentiation, circulation in blood regulation, Energy metabolism. These categories are illustrated by colors on network nodes, and the genes with no category remained grey. The nodes with multiple colors were detected in different biological processes. The node size is illustrated by the node degrees. PLCG1 and ENTPD8 are two hub nodes in the network that participate in energy and nucleobase metabolisms, respectively.
3.3 | Distant metastasis classification model
The distant metastasis potential of two nominated subnetworks for classifying patients into primary tumors was assessed using SVM, neural network, and decision tree methods. The 5-fold cross-validation SVM accuracy, sensitivity, and specificity scores for EMT related subnetwork are 79%, 78%, and 21%, respectively. The neural network accuracy, sensitivity, and specificity scores are 18%, 18%, and 80%, respectively. Eventually, the decision tree accuracy, sensitivity, and specificity scores are 60%, 60%, and 30%, respectively. These results refer to a full model (all genes in the subnetwork included). Compare to these three models, the SVM model is the strongest method in classifying metastatic and non-metastatic patients, but the specificity score is too low. The SVM model validated in GSE9195.
The SVM accuracy, sensitivity, and specificity scores for immune-related subnetwork are 78%, 78%, and 78%, respectively. The neural network accuracy, sensitivity, and specificity scores are 85%, 85%, and 14%, respectively. Eventually, the decision tree accuracy, sensitivity, and specificity scores are 71%, 71%, and 36%, respectively. These results refer to a full model (all subnetwork genes included). The specificity of the neural network and decision tree method is low compare to the SVM. Due to results, the SVM model is the most powerful method in classifying metastatic and non-metastatic patients for the immune-related subnetwork. The immune-related subnetwork accuracy, sensitivity, and specificity for the SVM model are superior to EMT related subnetwork.
The feature selection methods were implemented. The accuracy, sensitivity, and specificity of feature selection models (GA and WCC) did not improve. The feature selection algorithms implemented in the SVM model for both subnetworks. The WCC introduced 13 genes, and GA introduced 12 genes for EMT related subnetwork. The WCC algorithm introduced 15 genes, including HLA-E, MYLK, WIPF1, TLN1, F13A1, NRGN, ICAM2, PTGS1, SELP, PF4, ITGA2B, GFI1B, TUBB1, PTCRA, RUFY1, BIN2, and CLEC1B for EMT subnetwork. The GA introduced 17 genes for immune-related subnetworks, including CCL5, MYLK, WIPF1, TLN1, NRGN, GNG11, PTGS1, SELP, ITGA2B, MAX, GFI1B, P2RX1, PTCRA, RUFY1, and BIN2. The SVM model (full model) for immune subnetwork validated in GSE9195. The accuracy, sensitivity, and specificity are 0.868, which is superior to GSE7390. The results confirm that the immune-related genes detected in this study can classify metastatic and non-metastatic samples more precisely compared to the neural network and decision tree models in two data sets. We implemented the classification methods to assess the metastasis potential of two nominated subnetworks which extracted from CTCs’ data. We fitted the Cox-PH as a predictive model to discriminate patients to low, medium, and high metastasis risk groups.
3.4 | Distant metastasis and overall survival analysis
The association between gene expression and distant metastasis-free survival /overall survival was implemented to detect metastasis potential genes in selected subnetworks. The JUP, KRT18, and KRT19 overall survival and distant metastasis-free survival are significant by log-rank test (p-value < 0.05) (Fig. 4a, b, c, d, e, and f). These three genes belong to EMT subnetwork. The high expression level of JUP, KRT18, and KRT19 associate with low overall survival and further lower metastasis progression. The distant metastasis-free survival and overall survival curves indicate the same pattern.
We fitted a metastasis free-survival Cox-PH regression model for EMT and Immune subnetworks to assess metastasis progression and patients’ predictive models. The Immune Cox-PH model includes of RUFY1 and P2RX1 variables (Likelihood ratio test p-value = 0.0295). The EMT Cox-PH model includes of RAN, PEBP1, KRT8, DSP, DDR1, and CLDN4 variables (Likelihood ratio test p-value = 0.0001016). The variables’ coefficients and p-values reported in Table S4 and S5. All the significant genes in the model have VIF < 2 to avoid multicollinearity (Table S6 and S7). The proportional hazard assumption for two model variables was assessed by the Schoenfeld residuals (Fig. S2 and S3). The predictive Cox-PH models for distant metastasis-free survival for two subnetworks are illustrated in Fig. 4g and h. The concordance index, as a model evaluation measure, for EMT and Immune predictive Cox-PH models are 0.7 and 0.6, respectively. Therefore, the EMT model is more powerful in discriminating patients into low, medium, and high metastasis risk groups compare to the Immune model (higher concordance index indicates more power in discrimination).