Machine learning models for predicting lymph node and distant metastases in colorectal cancer

Background: Colorectal cancer (CRC) is the third most common malignancy in the world and metastasis is responsible for a major proportion of the cancer-related deaths in CRC patients. Aims: To construct machine learning models for predicting lymph node and distant metastases in colorectal cancer and analyze biological functions features of metastasis-related genes. Methods: RNA-seq and miRNA-seq data as well as corresponding clinical data from colon adenocarcinoma (COAD) and rectum adenocarcinoma (READ) were obtained from The Cancer Genome Atlas (TCGA) database. The differentially expressed RNAs (DE-RNAs) in non-LNM (N0) and LNM (N1/N2) as well as non-distant metastases (M0) and distant metastases (M1) were analyzed. Six machine learning models including logistic regression (LR), random forest (RF), support vector machine (SVM), Catboost, gradient boosting decision tree (GBDT), and articial neural network (NN) were constructed to predict cancer metastasis and the feature genes of the optimal model were further analyzed by functional enrichment, protein-protein interaction (PPI) network, and drug-target analyses. Results: Differential RNA expression proles of LNM and non-LNM as well as M0 vs. M1 were observed in both COAD and READ samples. NN model was determined to be the optimal model for predicting distant metastases, while Catboost and LR models were the optimal models for predicting LNM in COAD and READ samples, respectively. PPI analysis indicated that KIR2DL4, chemokine-related genes CXCL9/10/11/13 and CCL25, and gamma-aminobutyric acid (GABA) receptor genes (GABRR1, GABRB2 and GABRA3) were key genes in metastasis. In addition, atorvastatin and eszopiclone were identied as potential therapeutic agents as they target these genes. Conclusions: We constructed six machine learning models for predicting colorectal cancer metastases and identify the optimal model. We analyzed biological functions features of metastasis-related RNAs in colorectal cancer.


Introduction
Colorectal cancer (CRC) is the third most common malignancy in the world. In the United States, clinicians observed a decreased CRC incidence rate in adults older than 50 years of age between 2000 and 2014, especially in distal tumors of individuals aged ≥ 65 years, while an increased incidence rate was observed in people less than 50 years of age during the same period [1]. The incidence and mortality of CRC vary between human development levels, with both increasing rapidly in low/middle-income countries while highly developed countries seem to be experiencing a decrease in these parameters. It is estimated that the global burden of CRC will increase by 60% by the year 2030 with over 2.2 million new cases and 1.1 million deaths per year [2]. Tumor cells from primary tumors can migrate to regional lymph nodes and distant organs. Metastasis is responsible for most cancer-related deaths in patients with CRC [3]. Although most local CRC patients can be cured using surgical resection, only around 70% of CRC patients with regional lymph node metastasis (LNM) can be cured by surgery coupled with adjuvant chemotherapy, and advanced metastatic CRCs are still mostly incurable, in spite of the improvements in medical treatment [4,5]. Thus, the accurate prediction of metastasis in CRC is crucial for its treatment.
Machine learning has been shown to be extremely accurate and precise when used to predict medical outcomes outstripping standard statistics and human judgment [6,7]. A previous review showed that machine learning models displayed excellent performance in predicting the outcomes of various neurosurgical conditions, outstripping established prognostic indicators and clinical experts [8]. Andrés et al. reported that machine learning showed increased sensitivity and speci city when predicting nodal metastasis compared to the tumor depth invasion model developed using early oral squamous cell carcinoma, and the forest algorithm was shown to be the most e cient model for this data [9]. Zhi et al. revealed that the genes identi ed by the support vector machine (SVM) classi er algorithm could accurately distinguish metastatic and non-metastatic CRC samples [10]. Arti cial neural network (NN) can be used to realize effective analysis of non-linear datasets and have been successfully used to assist clinical decision-making in neurosurgery [11]. In addition, it has been reported that NN is better at predicting relapse in breast cancer than the logistic regression (LR) model [12]. However, there are no studies on the applications of different machine learning models to CRC metastasis.
In this study, RNA expression data and clinical phenotype data from The Cancer Genome Atlas (TCGA) database, was analyzed using six machine learning models, including LR, random forest (RF), SVM, Catboost, gradient boosting decision tree (GBDT), and NN to predict CRC metastasis. Among the six models, the genes identi ed by the optimal models were further analyzed using functional enrichment analysis, protein-protein interaction (PPI) network and drug-target analysis (Supplemental gure 1 showing the work ow). This study could provide valuable insight for the treatment of CRC by providing biomarkers for the prediction of CRC metastasis.

Data acquisition and preprocessing
The sequencing data from RNA (RNA-seq) and microRNA (miRNA-seq) experiments as well as the corresponding clinical data of TCGA-colon adenocarcinoma (TCGA-COAD) and TCGA-rectum adenocarcinoma (TCGA-READ) were downloaded from the university of California Santa Cruz (UCSC, http://xena.ucsc.edu/) Genome Browser database. Based on the annotation le (gencode.v22.annotation.gene) available from the GENCODE database, the RNA-seq and miRNA-seq data were annotated to identify mRNA, lncRNAs, snoRNAs, and miRNA transcripts.

Differential expression analysis
For both COAD and READ data sets, the differential expression analysis of non-LNM (N0) and LNM (N1/N2), and non-distant (M0) and distant metastases (M1) were performed using the limma package. The differentially expressed mRNA (DE-mRNA), lncRNAs (DE-lncRNAs), snoRNAs (DE-snoRNAs), and miRNAs (DE-miRNAs) with a P value < 0.05 and |log fold change (FC)| > 0.585 or 1 (the number of feature genes used in the construction of the data models must be less than the number of samples, therefore the thresholds were different when identifying the candidate feature genes) were selected. The ggpubr package was used to visualize the volcano plot.
Training of the optimal model and performance evaluation The count value of the candidate feature genes was standardized to log2 (x+1) data and binary labels were added to each sample: metastases was 1 while non-metastases was 0. After that, samples were divided into a training (80% samples) and a test (20% samples) dataset using the train_test_split method from the scikit learn package in python. The sklearn.linear_model, sklearn.ensemble, sklearn.svm and sklearn.neural_network methods were used to construct LR, RF, GBDT, SVM and NN machine learning models, and the Catboost package was used to construct the Catboost machine learning model. The expression value of the feature genes in the samples were used as the feature value to classify and discriminate the samples. The Recursive Feature Elimination (RFE) algorithm was implemented in the sklearn.feature_selection method. After cross-validation, the optimal feature genes from each model were identi ed based on the area under the receiver operating characteristic (ROC) curve (AUC). The LNM and distant metastases in COAD and READ were predicted using the six models, and the model with the highest AUC was selected as the optimal model. The formula or code for all six models used in this study are shown in Supplemental les 1-4. The feature genes of the optimal model were used in the following analyses.

Functional enrichment analysis
To investigate the biological function of the feature genes, the Gene Ontology (GO) annotation terms and KEGG pathways were enriched using the clusterPro ler tool with a cut-off value of P < 0.05 and count ≥ 2.

Construction of the PPI network
The interactions between the feature genes were retrieved using the STRING database (Version: 11.0, http://www.string-db.org/) with the parameters set as follows: (1) 0.4 (medium con dence) PPI score; (2) species: Homo sapiens; and (3) disable structure previews inside network bubbles, hide disconnected nodes in the network. The PPI network was visualized using Cytoscape software.

Prediction of drug-target pairs
The drug-target pairs for the feature genes were predicted using the DGIdb 3.0 database. The drug-target pairs with FDA approval or those reported in published studies were selected and visualized using Cytoscape software.

Results
RNAs were differentially expressed in LNM and non-LNM samples A total of 438 COAD samples with RNA-seq, miRNA-seq and clinical data were included, of which 255 were N0 and 183 were N1/N2 samples. In total, 352 RNAs were differentially expressed between LNM and non-LNM samples, including 263 unique DE-mRNAs, 60 DE-lncRNAs, 10 DE-miRNAs and 19 DE-unde ned genes ( Figure 1A and Table 1).
For READ, a total of 160 samples with RNA-seq, miRNA-seq and clinical data were included, of which 80 were N0 samples and 76 were N1/N2 samples. A total of 474 RNAs were differentially expressed in LNM vs. non-LNM samples, including 244 unique DE-mRNAs, 87 DE-lncRNAs, 26 DE-miRNAs and 117 DEunde ned genes ( Figure 1B and Table 1).

RNAs were differentially expressed in distant metastases and non-distant metastases samples
Among the 438 COAD samples, there were 316 M0 and 63 M1 samples. In all, 129 RNAs were differentially expressed in M0 and M1 samples, including 81 unique DE-mRNAs, 22 DE-lncRNAs, 5 DE-miRNAs, and 21 DE-unde ned genes ( Figure 1C and Table 1).
Among the 160 READ samples, there were 121 M0 and 22 M1 samples. A total of 134 RNAs were differentially expressed in the M0 and M1 samples, including 90 unique DE-mRNAs, 34 DE-lncRNAs, and 10 DE-unde ned genes ( Figure 1D and Table 1).

Identifying the optimal model to predict LNM in COAD and READ samples
For LNM prediction in COAD we evaluated the AUC for each of the six machine learning models. They were as follows: 0.7671 for the LR model, 0.7722 for the NN model, 0.7532 for the SVM model, 0.7610 for the RF model, 0.7634 for the GBDT model, and 0.8040 for the Catboost model ( Figure 2A). Therefore, Catboost was identi ed as the optimal model for the prediction of LNM in COAD samples. There were 236 feature genes identi ed by the Catboost model (Supplemental Supplemental material 1), of which C6orf15 and CXCL11 were shown to make the largest contribution (Supplemental Figure 2).
For LNM prediction in READ samples, the LR model showed the largest AUC at 0.9254 and thus was identi ed as the optimal model ( Figure 2B). A total of 292 feature genes were identi ed by this model (Supplemental material), and the signi cance of each of these genes is shown in Supplemental Figure 3.
Identi cation of the optimal model to predict distant metastases in COAD and READ samples For the prediction of distant metastases in COAD, the AUC for all six models were evaluated. The values were as follows: 0.6914 for the LR model, 0.8047 for the NN model, 0.6432 for the SVM model, 0.6992 for the RF model, 0.6406 for the GBDT model, and 0.6953 for the Catboost model ( Figure 2C). The NN model was thus identi ed as the optimal model to predict distant metastases in COAD. There were 129 feature genes identi ed by the NN model.
For the prediction of distant metastases in READ samples, the NN model was also identi ed as offering the best AUC value (0.8600) making it the optimal model ( Figure 2D). A total of 134 feature genes were identi ed by the NN model.
Biological functions of the feature mRNAs identi ed by the optimal models The feature mRNAs identi ed by the optimal models are listed in Table 2. In order to investigate the biological functions of these mRNAs, GO annotation and KEGG pathway enrichment was analyzed. For the feature mRNAs distinguished by the Catboost model for LNM prediction in COAD, we observed a signi cant enrichment in calcium ion homeostasis, calcium ion transport into the cytosol, the calcium signaling pathway amongst others ( Figure 3A).
The mRNAs identi ed by the LR model for LNM prediction in READ were implicated in retinoid/diterpenoid/ metabolic process, cholesterol metabolism, and remodeling related biological processes, including, protein−containing complex remodeling, protein−lipid complex remodeling, plasma lipoprotein particle remodeling ( Figure 3B).
The mRNAs identi ed by the NN model for distant metastases prediction in COAD were found to participate in terms that included regulation of transmembrane transport/ion transport, T cell chemotaxis, chemokine signaling pathway amongst others ( Figure 4A). For the mRNAs identi ed by the NN model for distant metastases prediction in READ, the enriched results mainly contained killing of cells of other organisms, disruption of cells of other organisms, triglyceride−rich lipoprotein particle remodeling, and cholesterol/thiamine metabolism ( Figure 4B).
PPI network from the feature mRNAs identi ed by the optimal models PPI analysis was performed on the feature mRNAs identi ed by the optimal models, and the mRNAs were resolved using the STRING database were listed in Supplemental material. Isthmin 2 (ISM2) and killer cell immunoglobulin like receptor, two Ig domains and long cytoplasmic tail 4 (KIR2DL4) were the only overlapping mRNAs between all four groups. Figure 5A shows the PPI network of the mRNAs identi ed in the Catboost prediction model for LNM in READ. Among the mRNAs in gure 5A, C6orf15, an uncharacterized protein located on chromosome 6 open reading frame 15, had the greatest log FC value (positively correlated with node size) in the differential expression analyses. While GNG4, G protein subunit gamma 4, interacted with the most proteins. The PPI network of the mRNAs identi ed by the LR prediction model for LNM in READ is shown in Figure 5B and identi es KIR2DL4 as the most signi cant component.
The PPI network of the mRNAs identi ed by the NN model for distant metastases prediction in COAD is shown in Figure 6A, and was shown to include several chemokine genes, including C-X-C motif chemokine ligand 9 (CXCL9), CXCL10, CXCL11, CXCL13 and C-C motif chemokine ligand 25 (CCL25). The PPI network of the mRNAs identi ed by the NN model for distant metastases prediction in READ is shown in Figure 6B. Regenerating family member 3 alpha (REG3A), defensin alpha 5 (DEFA5), and DEFA6 were shown to have the greatest log FC (positively correlated with node size) values in the differential expression analysis.
Drug-target pairs for feature mRNAs For LNM prediction in COAD, 264 drug-gene pairs were predicted to target 29 feature mRNAs ( Figure 7A. Gamma-aminobutyric acid type A receptor alpha3 subunit (GABRA3), cholinergic receptor muscarinic 2 (CHRM2), Fc fragment of IgG receptor IIIb (FCGR3B), and dopamine receptor D2 (DRD2) were targeted by the most drugs. A total of 9 drugs were found to target CXCL10. From the LNM prediction in READ, 113 drug-gene pairs were predicted to target 12 feature mRNAs ( Figure 7B ). GABRB2 was targeted by 11 drugs, and all 11 drugs were potentiators of GABRB2. Cyclosporine and deferoxamine were found to target CXCL2.
For the distant metastases prediction in COAD, 83 drug-gene pairs were predicted to target 17 feature mRNAs ( Figure 8A). For the distant metastases prediction in READ, 119 drug-gene pairs were predicted to target 17 feature mRNAs ( Figure 8B). A total of 9 drugs were found to target CXCL10. GABRA3, 5hydroxytryptamine receptor 1D (HTR1D), HTR2C, and solute carrier family 6 member 4 (SLC6A4) were targeted by the most drugs, and the interaction types were all known.

Discussion
Machine learning has recently been shown to be an accurate and precise method for predicting medical outcomes outstripping standard statistics and human judgment. Recently, aberrant expression of genes, and several kinds of non-coding RNAs have been demonstrated to function in CRC metastasis and could serve as predictive biomarkers for this condition [13][14][15]. In this study, the NN model was identi ed as the optimal model for predicting distant metastases, while Catboost and LR models were shown to be optimal models for predicting LNM in COAD and READ samples, respectively. Consistent with these observations Biglarian et al. showed that the ROC for NN and LR models predicting distant metastasis of CRC were 0.82 and 0.77, respectively, suggesting that the NN model was more suitable for the prediction of distant metastasis in CRC [16]. The feature genes from the optimal models were signi cantly enriched in calcium ion homeostasis, transmembrane transport/ion transport, T cell chemotaxis, chemokine signaling pathway amongst others. While the PPI analysis indicated that KIR2DL4, chemokine-related genes (CXCL9/10/11/13 and CCL25), and gamma-aminobutyric acid (GABA) receptor genes ( GABRR1, GABRB2 and GABRA3) are all key genes in metastasis. In addition, atorvastatin and eszopiclone were shown to target those genes, suggesting they may have some therapeutic value. KIR2DL4, also known as CD158D, is a member of the killer cell immunoglobulin-like receptor (KIR) family expressed by natural killer (NK) cells [17]. KIRs can recognize histocompatibility complex (MHC) ligands and mediate the function of NK cells. KIR2DL4 functions as an inhibitory receptor that releases inhibitory signals to NK cells [17,18]. HLA-G, a non-classical class I human leukocyte antigen, is the only known ligand for KIR2DL4 [17]. Notably, expression of HLA-G has been reported to be a predisposing factor for metastasis [19]. Studies have revealed the associations between HLA-G expression and lymph node metastasis in cervical cancer [20], papillary thyroid cancer [21], and CRC-associated liver metastases [22]. Reportedly, HLA-G upregulates the expression of matrix metalloproteinases (MMPs) and other tumor promoting factors, so that tumor cells have a higher invasion and metastasis potential [23]. Ueshima et al. suggested that KIR2DL4+ tissue mast cells promote LNM and lymph-vascular invasion in HLA-G+ breast cancer cells [24]. Thus, we hypothesize that KIR2DL4 plays a crucial role in CRC metastases probably via interactions with its ligand HLA-G.
Studies have shown that chemokines play important roles in clinical outcome prediction and invasion/metastasis of various cancers [25,26]. For example, CXCL10 elevates the expression of MMP9, and triggers cell migration and invasion in metastatic CRC cells rather than primary cancer cells [27]. Tokunaga et al. revealed that the chemokine CXCL9/CXCL10/CXCL11/CXCR3 axis mediates differentiation and migration of immune cells via the paracrine axis and promotes cancer metastasis via the autocrine axis, suggesting that this axis is a potential target for anti-cancer therapies [28]. This is consistent with the results of our study where we found that aberrant expression of CXCL9/CXCL10/CXCL11/CXCL13/CCL25 were identi ed as features in the NN model used to predict distant metastases and were enriched in T cell chemotaxis and chemokine signaling pathways. We suggest that this implicates these chemokines in the regulation of CRC metastases. Notably, several drugs were found to target CXCL10, for example, atorvastatin. Atorvastatin has been reported to decrease plasma CXCL10 levels in the treatment of Crohn's disease [29], and decrease plasma CXCL9 levels in the treatment of systemic lupus erythematosus [30]. In addition, combined treatment with atorvastatin and phloretin induces apoptosis and G2/M arrest in colon cancer cells [31]. Thus we speculate that atorvastatin may be a promising drug for targeting chemokines during the treatment of CRC.
Gamma-aminobutyric acid (GABA) is an inhibitory neurotransmitter which interacts with two types of receptors, including GABA A and GABAB. GABAA receptors are ionotropic receptors consisting of various subunits and functioning as chloride channels [32,33]. GABA has been implicated in cancer metastasis [34,35]. GABRR1, GABRB2 and GABRA3 are all subunits of the GABAA receptor. Liu et al. suggested that GABRA3 could promote LNM in lung adenocarcinoma by inducing MMP-2 and MMP-9 expression [36]. In this study, GABRR1, GABRB2 and GABRA3 were all aberrantly expressed in CRC, and were among the feature genes identi ed by the optimal machine learning models for the prediction of CRC metastasis. They were enriched in ion transport and transporter activity related terms. Reportedly, ion channels/transporters are important operators of various cell-cell signaling pathways as they sense and respond to changes in the environment. Ion channels/transporters participate in each step of the cascade in cancer metastasis, suggesting that they could be potential therapeutic targets in cancer therapy and metastasis prevention [37][38][39]. In this study, various drugs were predicted to target GABRR1, GABRB2 and GABRA3. These included eszopiclone, which is an agonist/positive allosteric modulator of GABRA3. Studies have shown the interactions between eszopiclone and GABA receptors [40,41]. However, despite these novel ndings, this study did have some limitations. (1) We analyzed the expression of different kinds of RNAs in CRC, and the role of differentially expressed mRNAs in the prediction of CRC metastasis. These analyses should be extended to the effects of the differential expression of the various miRNAs, lncRNAs and snoRNAs identi ed in previous CRC metastasis studies. (2) The effect of the key mRNAs identi ed in the machine models should be validated by experimental and clinical data. (3) The predicted drug-gene interactions should be con rmed to provide insight into their application in CRC treatment and the prevention of CRC metastasis.

Conclusions
We constructed multiple machine learning models for colorectal cancer metastasis, identi ed the optimal model, and applied the optimal model to analyze the valuable RNAs related to colorectal cancer metastasis. The machine learning models could contribute to the prediction of LNM and distant metastases in CRC. This study might provide a novel routine for screening the promising targets for CRC treatment and the prevention of CRC metastases.

Declarations
Ethics approval and consent to participate Not Applicable.

Consent to publish
Not Applicable.

Availability of data and materials
The datasets generated during the current study are not publicly available but obtained from corresponding authors on reasonable request.

Competing Interests
The authors declare that no con icts of interest exist.

Authors' Contributions
All authors participated in the conception and design of the study. Han Shuwen and Yang Xi conceived the manuscript; Han Shuwen, Yang Xi and Zhuang Jing wrote the paper; Dai Xiaoyu and Dai Siqi analyzed the data; Zhuang Jing and Liu Jin drew gures; All authors read and approved the paper.   Figure 1 The volcano plot of the differentially expressed genes. Volcano plot of the differentially expressed genes (including mRNAs, lncRNAs, snoRNAs, miRNAs and unde ned genes) between N0 and. N1/N2 in COAD samples, N0 and. N1/N2 in READ samples, M0 and M1 in COAD samples, and M0 and M1 in READ samples. Red dots represent the up-regulated genes, and blue dots represent the down-regulated genes.

Figures
The dots with genes name were in the top 30 most differentially expressed genes.

Figure 2
Receiver operating characteristic (ROC) curve for each of the six prediction models. Receiver operating characteristic (ROC) curve analyses for the six lymph node metastasis prediction models in COAD (A) and READ (B) samples showing the prediction performance of each model; ROC curves for the six distant metastases prediction models in COAD (C) and READ (D) samples showing the prediction performance of each of these models.

Figure 3
Functional enrichment analysis for the genes identi ed in the optimal LNM prediction models. Bubble chart of GO terms and KEGG pathways enriched for the genes identi ed in the optimal lymph node metastasis prediction models for COAD (A) and READ (B) data showing the functions of those genes; bubble size represents the count of these enriched genes; the color (red to blue) represents decreasing P value (from high to low).

Figure 4
Functional enrichment analysis for the genes identi ed in the optimal distant metastases' prediction models. Bubble chart of GO terms and KEGG pathways enriched for the genes identi ed in the optimal distant metastases prediction models from COAD (A) and READ (B) data showing the functions of these genes; bubble size represents the count for each of these enriched genes; the color (red to blue) represents changes in the P value (high to low).

Figure 5
PPI network for the genes identi ed by the optimal LNM prediction models. PPI network for the genes identi ed by the optimal lymph node metastasis prediction models in COAD (A) and READ (B) demonstrating interactions among these genes; red nodes represent up-regulated genes; blue nodes represent down-regulated genes; lines represent interactions between two nodes. PPI network for the genes identi ed by the optimal distant metastases prediction models. PPI network for the genes identi ed by the optimal distant metastases prediction models in COAD (A) and READ (B) demonstrating the interactions between these genes; red nodes represent up-regulated genes; blue nodes represent down-regulated genes; lines represent interactions between two nodes. Drug-gene network for the genes identi ed by the optimal LNM prediction models. Drug-gene network for the genes identi ed by the optimal lymph node metastasis prediction models in COAD (A) and READ (B) showing the drugs that target these genes; red nodes represent genes; blue nodes represent drugs; lines represent interactions between gene and drug, and the phrase around the lines represent the interaction type.

Figure 8
Drug-gene network for the genes identi ed by the optimal distant metastases prediction models. Druggene network for the genes identi ed by the optimal distant metastases prediction models in COAD (A) and READ (B) showing the drugs that target these genes; red nodes represent genes; blue nodes represent drugs; lines represent interactions between gene and drug, and the phrase around the lines represent the interaction type.