A Novel Four-MicroRNA Signature for Prognosis and Diagnosis in Colorectal Cancer

Background and aims: Colorectal cancer (CRC) remains a global public health problem. We aimed to access prognostic microRNAs (miRNAs) associated with overall survival (OS) and evaluate their diagnostic value in CRC. Methods: Three serum miRNA expression proles from Gene Expression Omnibus (GEO) database and tissue miRNA expression prole of CRC from The Cancer Genome Atlas (TCGA) database were studied to get common differentially expressed miRNA (DEmiRNAs) in serums and tissues. 569 CRC patients with survival information from TCGA database were randomly divided into the discovery and validation cohort. Based on the TCGA discovery cohort, the univariate cox analysis, least absolute shrinkage and selection operator (LASSO) regression analysis and multivariate cox proportional hazards regression analysis were performed and a prognostic signature was constructed. The prognostic capacity of the signature was veried in TCGA validation cohort and the prognostic signature expression in tumors and serums was validated in an independent microarray dataset and our clinical set. We also evaluated the diagnostic power of the model in serum by calculating the area under the receiver operating characteristic (ROC) curve in a clinical set. Functional enrichment analyses were applied to analyze the potential roles of the signature in CRC. Results: A total of 154 common DEmiRNAs were identied. Based on the TCGA discovery cohort, a four-miRNA (miR-10b, miR-130a, miR-561, miR-4684) prognostic signature was constructed and a predictive nomogram model including the expression of above four miRNAs showed good performance for predicting the 3- and 5-year overall survival. The ndings were well veried in TCGA validation cohort. miR-10b, miR-130a, miR-561 and miR-4684 were signicantly upregulated both in CRC tumors and serums. Circulating miR-10b, miR-130a, miR-561 and miR-4684 levels were signicantly downregulated after surgical resection of tumor in CRC patients, suggesting these four circulating miRNAs in the serum were tumor derived. The four circulating miRNA combination might act as a novel diagnostic signature for CRC. Functional enrichment analyses suggested that above four with multiple integrated biomarkers seems to be better. In this study, 569 CRC patients with information of survival status and survival time from TCGA database were randomly divided into the discovery and validation cohort to construct and validate the prognostic model. Using multiple appropriate statistical methods, four miRNAs (miR-10b, miR-130a, miR-561, miR-4684) were identied as independent prognostic indicators in the TCGA discovery cohort. An integrated four-miRNA prognostic signature was established. According the expression of the above miRNA signature and regression coecients, risk scores of patients in the TCGA discovery cohort were calculated and patients were divided into two groups with low-risk and high-risk. Kaplan-Meier analysis indicated that the high-risk group had a worse OS than the low-risk group. ROC analysis and the areas under curve showed that the four-miRNA signature exhibited high predictive accuracy in the 3-year and 5-year OS prediction. The universal applicability of the four-miRNA signature was veried by the TCGA validation cohort.

Characterization of DEmiRNAs associated with OS CRC patients without survival information in TCGA database were excluded. According the computer-generated distribution sequence, 569 CRC patients were randomly divided into the TCGA discovery cohort (n = 285) and TCGA validation cohort (n = 284). Univariate Cox analysis was carried out in the TCGA discovery cohort to study the association between DEmiRNAs expression and OS of CRC. The miRNAs with a p-value < 0.05 were considered to be signi cantly related to OS and were selected for next analysis. The optimal miRNAs were identi ed by using least absolute shrinkage and selection operator (LASSO) regression algorithm.
Establishment of prognostic signature in the TCGA discovery cohort The independent prognostic miRNAs of OS were determined by performing the multivariate Cox proportional hazards regression analysis. According the expression of key miRNAs and regression coe cients, a random forest plot was established. We calculated the risk scores of each patient with "Predict" function of R language. The TCGA discovery cohort was strati ed into two groups with high-risk and low-risk based on the median risk score. A heatmap was plotted to characterize the expression of key miRNAs and a Kaplan-Meier analysis was used to assess the prognosis difference between the two groups with high-risk and low-risk.
Assessing the prognostic signature in the TCGA discovery cohort In TCGA discovery cohort, a nomogram model was constructed based on the expression of key miRNAs and nomogram scores for patients were calculated. The agreement between model prediction and actual outcome was evaluated by calibration curves. 3-and 5-year OS probability of patients in TCGA discovery cohort was predicted based on the nomogram score, and the prognostic capacity was assessed by the ROC curve.
Validation of the prognostic signature in the TCGA validation cohort As same as the TCGA discovery cohort, TCGA validation cohort was divided into two groups with high-risk and low-risk according the risk scores of each patient calculated with "Predict" function of R language. A heatmap was also plotted to characterize the expression of key miRNAs and a Kaplan-Meier analysis was also used to assess the prognosis difference between the two groups. The prognostic accuracy in TCGA validation cohort was also assessed by the time-dependent ROC curve.

Patients and clinical samples
To validate the miRNAs expression and identify the diagnostic value of prognostic model, a total of 52 CRC patients who undergo surgical treatment in the Shenzhen Second People's Hospital (Shenzhen, China) and 52 volunteer healthy controls were invited to participate in this study. All the CRC patients did not receive treatment before surgery. The clinicopathological information of the CRC patients was shown in Table 1. The Ethics Committee of the Tsinghua Shenzhen International Graduate School approved this study and we obtained the written informed consent from all participating individuals before using their tissue and blood specimens. The resected tumor tissue and non-cancerous samples (away from tumor > 5 cm) of CRC patients were stored in RNAlater™ stabilization solution immediately (Thermo Fisher Scienti c, Waltham, MA, USA) then stored at -80 °C for future experiment. Pre-and post-operation CRC peripheral blood as well as healthy peripheral blood were collected in vacuum blood tubes without clot activator, clot at 4 °C for 2 h, and rotate at 2,000 rpm at 4 °C for 10 minutes. The serums were transferred into RNase-free tubes then stored at -80 °C. N0: No regional lymph node metastasis; N1: Metastasis in 1-3 regional lymph nodes; N2: Metastasis in 4 or more regional lymph nodes. Quantitative real-time PCR for miRNA expression analyses We respectively used the RNAiso for Small RNA (Takara, Japan) and miRNeasy Serum/Plasma kit (Qiagen, Hilden, Germany) to extract total miRNA from the tissues and serums. miRNA was reverse-transcribed by using the Mir-X miRNA First-Strand Synthesis Kit (Takara). In order to identify the miRNA expression, quantitative real-time PCR (qRT-PCR) was applied on a 7500 PCR system (Thermo Fisher Scienti c) using the TB Green Advantage qPCR Premix (Takara). All the primers used for qRT-PCR were shown in Table 2. The applied PCR conditions were as follows: step 1, 5 minutes at 95 °C, step 2, 40 cycles of 10 seconds at 95 °C, 35 seconds at 60 °C. At the end of cycles, a melting curve was performed according to the pre-setted conditions on the 7500 PCR system. To calculate the relative expression levels of miRNAs in colorectal tumor tissues, snRNA U6 was the internal control and the 2 −△△Ct method was used. We used cel-miR-39 (Gene Pharma, Shanghai, China) as the external reference and added 1 ul cel-miR-39 at a concentration of 1umol/L to each serum sample. The relative expression of miRNAs in serums was calculate with the 2 −△Ct method. The diagnostic performance of miRNAs expression in distinguishing CRC patients from healthy controls was determined by performing ROC analysis. Sensitivity against (1-speci city) was plotted at each cutoff threshold, and the probability of identifying CRC patients from healthy subjects was calculated by the area under the curve (AUC) values.
Gene ontology (GO) and Protein-protein Interaction (PPI) enrichment analysis of the potential target genes of the miRNAs To explore the potential roles of the miRNAs in CRC, we used the mirDIP database to get the top 100 potential target genes with the highest score for each miRNA. The selected genes were used to identify the functional differences and the involved biological processes with the GO terms and pathway enrichment analysis in Metascape database (http://metascape.org/gp/index.html#/main/step1). Terms with enrichment factor > 1.5, p-value < 0.01 and minimum count of 3 were collected. According the membership similarities, these terms were grouped into clusters. PPI enrichment analysis was carried out with the following algorithm in Metascape database: BioGrid, OmniPath and InWeb_IM [22][23][24]. The network contained proteins subset that interact with at least one other member in the given list, and the Molecular Complex Detection (MCODE) network was gathered to identify densely connected network components. MCODE is a graph-based approach clustering algorithm, which can quickly detect dense connected regions from large-scale protein networks and measure the degree of association of proteins in the modules.
Statistical analysis SPSS20.0 and Graphpad prism8.0 software were used for statistical analyses. The independent prognostic miRNAs for CRC were identi ed by performing the univariate and multivariate Cox proportional hazards regression analyses, and the prognostic capacity was assessed by the time-dependent ROC curve. Using the Kaplan-Meier method and logrank test to calculate survival rate and con rm the comparisons. The student's t test was used to determine the miRNAs expression difference between two groups. ROC curve and calculated AUC were used to discriminate CRC patients from healthy subjects. The criterion for statistical signi cance was p < 0.05 (two-sided).

Results
Common DEmiRNAs in serum and tumor tissue of CRC patients Figure 1 showed the overview of study strategy. The dysregulated miRNA expression pattern of GSE113486, GSE124158, GSE106817 and TCGA CRC cohort was respectively visualized by volcano plot ( Fig. 2A-2D). Red and green dots respectively indicated the upregulated and downregulated miRNAs in CRC serums or tumor tissues. Venn analysis was used to identify the common upregulated and downregulated miRNAs in both serums and tumor tissues of CRC patients and a Venn diagram was generated. Finally, we con rmed a total of 154 upregulated DEmiRNAs (Fig. 2E) and 0 downregulated DEmiRNAs (Fig. 2F). The common DEmiRNAs were showed in Table 3.
Prognostic capacity of the integrated four-miRNA signature in the TCGA validation cohort Next, different patients cohort was used to assess the common prognostic capacity of the integrated four-miRNA signature. Using the above four miRNA signature, 284 patients in the TCGA validation cohort were also divided into two groups with high-risk (n = 142) and low-risk (n = 142). The distributions of risk scores and survival status were plotted (Fig. 5A-B). A risk heatmap was constructed based on the expression of four miRNAs in TCGA validation cohort (Fig. 5C). Kaplan-Meier analysis suggested that the high-risk group had a worse OS than the low-risk group (P = 0.0127) (Fig. 5D). The ROC analysis was performed and the AUC for 3-and 5-years survival were 0.625 and 0.723, respectively (Fig. 5E). These results showed a good agreement with those in the TCGA discovery cohort.
Expression of the four miRNAs in an independent GEO dataset and clinical specimens The expression of miR-10b, miR-130a, miR-561 and miR-4684 was identi ed in an independent GEO dataset (GSE98406). We also studied the miRNAs expression of tumor tissues compared to non-cancerous tissues by qRT-PCR in our clinical specimens. Revealing a consistently different expression with the TCGA database, the four miRNAs expression of tumor tissues was signi cantly upregulated both in GSE98406 dataset and our clinical specimens ( Fig. 6A-B).

The diagnostic capacity of the four circulating miRNAs
The circulating miR-10b, miR-130a, miR-561 and miR-4684 expression in the preoperative and postoperative serums of CRC patients was tested by qRT-PCR to investigate whether circulating miRNAs in the serum were tumor derived. The expression of circulating miRNAs was signi cantly decreased in postoperative serums compared with that in preoperative serums, suggesting that CRC tumor was the source of the increased circulating miRNAs in serum (Fig. 7A). The relative levels of miR-10b, miR-130a, miR-561 and miR-4684 were signi cantly higher in the CRC serums than those in the healthy individuals (Fig. 7B). In order to assess the potential diagnostic capacity of the four circulating miRNAs, ROC analysis was carried out and AUC was calculated.  (Fig. 7C).

Function of the four miRNAs
To analyze the potential roles of the four miRNAs in CRC, their downstream target mRNAs were predicted by using mirDIP database. mirDIP database integrates the data of 30 databases which can predict the target genes of miRNAs. We selected 100 potential target genes with the highest score for each miRNA (Table 5) and Metascape database was used to carry out the GO and KEGG enrichment analyses. According to the top 20 signi cant pathways and functions of which p-value < 0.05, a heatmap enriched was constructed (Fig. 8A). Several biological processes and pathways such as chromatin remodeling, mitotic cell cycle phase transition, proteoglycans and transcriptional misregulation in cancer were related to the progression of cancer. According the correlated function pathway and constructed network, the highly enriched genes were clustered. In the diagram, the pathway or biological process which contains the greater number of genes was represented by the dark color (Fig. 8B) and different categories were represented by different colors (Fig. 8C). Last, the PPI network for the potential target genes was built (Fig. 8D) and the MCODE networks were gathered (Fig. 8E-F).
Cancer cells derive some molecules that contain signature markers of their origin cells into the peripheral blood, which is the theoretical basis of the concept of liquid biopsy [26,27]. The blood-based screening has the advantages of minimal invasiveness and good reproducibility. However, there are few routine liquid biopsy markers for CRC screening or diagnosis. Low sensitivity and speci city limited the application of carcinoembryonic antigen (CEA) in the CRC diagnosis or screening, so CEA was commonly to be monitor for CRC recurrence [28]. To develop more sensitive liquid biopsy biomarkers used as a replacement for or in combination with current stool-based screening is crucial to early detection of CRC and decrease of CRC-related mortality. Being detected in serum and tissue and remaining stable after long storage [29,30], miRNAs have been considered as suitable potential markers for liquid biopsy [16]. We also assessed the potential diagnostic capacity of above four miRNAs in independent clinical samples in the present study. The expression of circulating miR-10b, miR-130a, miR-561, miR-4684 in CRC serums was signi cantly higher than that in healthy subjects. The integrated four-miRNA signature exhibited high accuracy in distinguishing CRC patients from healthy controls.
To explore the potential biological function of the four miRNAs, we predicted the potential target mRNAs of each miRNA and performed the gene ontology as well as KEGG pathway analysis. The results suggested that these four miRNAs were potentially involved in several cancer-related biological processes and pathways such as chromatin remodeling, mitotic cell cycle phase transition, proteoglycans and transcriptional misregulation in cancer. Transcriptional regulation is essential to intra-and extra-cellular signals responding, to de ne and maintain cell identity, and to coordinate cellular activity [31]. It is well known that most cancers are characterized by transcriptional dysregulation. Chromatin remodeling is mainly involved in the regulation of gene transcription, and is related to cell apoptosis, DNA damage repair, cell proliferation and differentiation, and maintenance of genomic stability [32]. Abnormal chromatin remodeling is closely related to the occurrence of cancers [33]. The cell cycle is strictly ordered under the regulation of control mechanisms. Accurate cell cycle phase transition is crucial to genome duplication and chromosome segregation [34]. Misregulation of cell cycle phase such as G1/S, G2/M or S/G2 transitions may lead to tumorigenesis. Some miRNAs were reported to be associated with above biological processes in cancers. miR-661 and miR-22 were reported to regulate chromatin remodeling [35,36], miR-122, miR-15a, miR-26, miR-29, mir-30 and let-7 were related to the transcriptional misregulation in cancer [37], miR-188, miR-638 and miR-1258 were involved in the cell cycle regulation [38][39][40].
Some mechanisms of miR-10b and miR-130a involved in CRC have been reported. miR-10b signi cantly inhibited PIK3CA expression and suppressed the activity of PI3K/Akt/mTOR pathway, increased TGF-β and SM α-actin expression, promoted cancer associated broblasts (CAFs) formation and CRC growth [41]. miR-10b may target HOXD10 ,E-cadherin, KLF4 and Rhoc, promoting the invasion and migration in CRC tumor and up-regulation of miR-10b was associated with liver metastasis in CRC [42][43][44][45][46]. miR-130a was also reported to promote proliferation, migration and invasion of CRC cells by regulating FOXF2 [47]. miR-130a was a potential therapeutic target of CRC. miR-130a functioned as a radiosensitizer in rectal cancer [48], and curcumin and mesalazine suppressed the colon cancer proliferation by inhibiting the expression of miR-130a [49,50]. To our knowledge, this study is the rst to clarify the speci c expression of miR-561 and miR-4684 in the tumors & serums of CRC patients and to identify their prognostic and diagnostic capacity in CRC. The molecular mechanisms of miR-561 and miR-4684 involved in CRC completely unknown.
We should point out the limitations of the present study. First, the prognostic capacity and diagnostic capacity of the four-miRNA signature were only separately evaluated in TCGA dataset and a clinical cohort. The effectiveness of this signature should be further validated in different cohorts. Secondly, there remains a need for the in-depth research of the miRNAs biological functions to explore the novel mechanisms of CRC carcinogenesis.
We established a novel integrated four-miRNA signature which was signi cantly related to OS in CRC patients. This four-miRNA signature could accurately distinguish low prognostic risk patients from high prognostic risk patients. Moreover, this four-miRNA signature exhibited a good accuracy and reliability in identifying CRC patients from healthy controls. These results suggested that this four-miRNA signature could be a potential prognostic and diagnostic model for CRC. The main owchart of this study. GSE113486 dataset contains serum samples from 40 CRC patients and 100 non-cancer controls, GSE124158 dataset contains serum samples from 30 CRC patients and 275 healthy controls, GSE106817 dataset contains serum samples from 115 CRC patients and 2759 healthy controls, colorectal tumor and matched non-cancerous tissues samples from 11 patients were analyzed in TCGA database. 569 CRC patients with survival information in TCGA database were randomly divided into the TCGA discovery cohort (n =285) and TCGA validation cohort (n = 284). There were 52 CRC patients and 52 healthy controls in the clinical set.

Figure 3
Construction of the prognostic signature based on the TCGA discovery cohort. A. LASSO coe cient pro les of the 12 miRNAs with hazard ratio (HR) > 1. B.
Cross-validation to select the optimal tuning parameter (λ). C. HRs and 95% CIs of the four miRNAs based on multivariate Cox regression analysis of the TCGA discovery cohort. D. The distribution of risk scores. E. The distribution of OS and OS status in the high-and low-risk groups. F. Heatmap of the four prognostic miRNAs expression pattern between high-and low-risk groups. G. Kaplan-Meier survival plots of high-or low-risk patients in the TCGA discovery cohort.   Expression of the four miRNAs in an independent GEO dataset and clinical specimens. A-B. miR-10b, miR-130a, miR-561 and miR-4684 were signi cantly upregulated in CRC tumor tissues. NC: non-cancerous. * p < 0.05; **p < 0.01.