RNAs were differentially expressed in LNM and non-LNM samples
A total of 438 COAD samples with RNA-seq, miRNA-seq and clinical data were included, of which 255 were N0 and 183 were N1/N2 samples. In total, 352 RNAs were differentially expressed between LNM and non-LNM samples, including 263 unique DE-mRNAs, 60 DE-lncRNAs, 10 DE-miRNAs and 19 DE-undefined genes (Figure 1A and Table 1).
For READ, a total of 160 samples with RNA-seq, miRNA-seq and clinical data were included, of which 80 were N0 samples and 76 were N1/N2 samples. A total of 474 RNAs were differentially expressed in LNM vs. non-LNM samples, including 244 unique DE-mRNAs, 87 DE-lncRNAs, 26 DE-miRNAs and 117 DE-undefined genes (Figure 1B and Table 1).
RNAs were differentially expressed in distant metastases and non-distant metastases samples
Among the 438 COAD samples, there were 316 M0 and 63 M1 samples. In all, 129 RNAs were differentially expressed in M0 and M1 samples, including 81 unique DE-mRNAs, 22 DE-lncRNAs, 5 DE-miRNAs, and 21 DE-undefined genes (Figure 1C and Table 1).
Among the 160 READ samples, there were 121 M0 and 22 M1 samples. A total of 134 RNAs were differentially expressed in the M0 and M1 samples, including 90 unique DE-mRNAs, 34 DE-lncRNAs, and 10 DE-undefined genes (Figure 1D and Table 1).
Identifying the optimal model to predict LNM in COAD and READ samples
For LNM prediction in COAD we evaluated the AUC for each of the six machine learning models. They were as follows: 0.7671 for the LR model, 0.7722 for the NN model, 0.7532 for the SVM model, 0.7610 for the RF model, 0.7634 for the GBDT model, and 0.8040 for the Catboost model (Figure 2A). Therefore, Catboost was identified as the optimal model for the prediction of LNM in COAD samples. There were 236 feature genes identified by the Catboost model (Supplemental Table 1), of which C6orf15 and CXCL11 were shown to make the largest contribution (Supplemental Figure 2).
For LNM prediction in READ samples, the LR model showed the largest AUC at 0.9254 and thus was identified as the optimal model (Figure 2B). A total of 292 feature genes were identified by this model (Supplemental Table 2), and the significance of each of these genes is shown in Supplemental Figure 3.
Identification of the optimal model to predict distant metastases in COAD and READ samples
For the prediction of distant metastases in COAD, the AUC for all six models were evaluated. The values were as follows: 0.6914 for the LR model, 0.8047 for the NN model, 0.6432 for the SVM model, 0.6992 for the RF model, 0.6406 for the GBDT model, and 0.6953 for the Catboost model (Figure 2C). The NN model was thus identified as the optimal model to predict distant metastases in COAD. There were 129 feature genes identified by the NN model.
For the prediction of distant metastases in READ samples, the NN model was also identified as offering the best AUC value (0.8600) making it the optimal model (Figure 2D). A total of 134 feature genes were identified by the NN model.
Biological functions of the feature mRNAs identified by the optimal models
The feature mRNAs identified by the optimal models are listed in Table 2. In order to investigate the biological functions of these mRNAs, GO annotation and KEGG pathway enrichment was analyzed. For the feature mRNAs distinguished by the Catboost model for LNM prediction in COAD, we observed a significant enrichment in calcium ion homeostasis, calcium ion transport into the cytosol, the calcium signaling pathway amongst others (Figure 3A).
The mRNAs identified by the LR model for LNM prediction in READ were implicated in retinoid/diterpenoid/ metabolic process, cholesterol metabolism, and remodeling related biological processes, including, protein−containing complex remodeling, protein−lipid complex remodeling, plasma lipoprotein particle remodeling (Figure 3B).
The mRNAs identified by the NN model for distant metastases prediction in COAD were found to participate in terms that included regulation of transmembrane transport/ion transport, T cell chemotaxis, chemokine signaling pathway amongst others (Figure 4A). For the mRNAs identified by the NN model for distant metastases prediction in READ, the enriched results mainly contained killing of cells of other organisms, disruption of cells of other organisms, triglyceride−rich lipoprotein particle remodeling, and cholesterol/thiamine metabolism (Figure 4B).
PPI network from the feature mRNAs identified by the optimal models
PPI analysis was performed on the feature mRNAs identified by the optimal models, and the mRNAs were resolved using the STRING database were listed in Supplemental Table 3. Isthmin 2 (ISM2) and killer cell immunoglobulin like receptor, two Ig domains and long cytoplasmic tail 4 (KIR2DL4) were the only overlapping mRNAs between all four groups. Figure 5A shows the PPI network of the mRNAs identified in the Catboost prediction model for LNM in READ. Among the mRNAs in figure 5A, C6orf15, an uncharacterized protein located on chromosome 6 open reading frame 15, had the greatest log FC value (positively correlated with node size) in the differential expression analyses. While GNG4, G protein subunit gamma 4, interacted with the most proteins. The PPI network of the mRNAs identified by the LR prediction model for LNM in READ is shown in Figure 5B and identifies KIR2DL4 as the most significant component.
The PPI network of the mRNAs identified by the NN model for distant metastases prediction in COAD is shown in Figure 6A, and was shown to include several chemokine genes, including C-X-C motif chemokine ligand 9 (CXCL9), CXCL10, CXCL11, CXCL13 and C-C motif chemokine ligand 25 (CCL25). The PPI network of the mRNAs identified by the NN model for distant metastases prediction in READ is shown in Figure 6B. Regenerating family member 3 alpha (REG3A), defensin alpha 5 (DEFA5), and DEFA6 were shown to have the greatest log FC (positively correlated with node size) values in the differential expression analysis.
Drug-target pairs for feature mRNAs
For LNM prediction in COAD, 264 drug-gene pairs were predicted to target 29 feature mRNAs (Figure 7A and Supplemental table 4). Gamma-aminobutyric acid type A receptor alpha3 subunit (GABRA3), cholinergic receptor muscarinic 2 (CHRM2), Fc fragment of IgG receptor IIIb (FCGR3B), and dopamine receptor D2 (DRD2) were targeted by the most drugs. A total of 9 drugs were found to target CXCL10. From the LNM prediction in READ, 113 drug-gene pairs were predicted to target 12 feature mRNAs (Figure 7B and Supplemental table 5). GABRB2 was targeted by 11 drugs, and all 11 drugs were potentiators of GABRB2. Cyclosporine and deferoxamine were found to target CXCL2.
For the distant metastases prediction in COAD, 83 drug-gene pairs were predicted to target 17 feature mRNAs (Figure 8A and Supplemental table 6). For the distant metastases prediction in READ, 119 drug-gene pairs were predicted to target 17 feature mRNAs (Figure 8B and Supplemental table 7). A total of 9 drugs were found to target CXCL10. GABRA3, 5-hydroxytryptamine receptor 1D (HTR1D), HTR2C, and solute carrier family 6 member 4 (SLC6A4) were targeted by the most drugs, and the interaction types were all known.