A novel seven-microRNA signature predicts prognosis in colorectal adenocarcinoma

Background Colorectal carcinoma, primarily colorectal adenocarcinoma (CRA), is one of the most common malignancies, ranking third, while contributing the second cause of cancer death in the world. MicroRNAs, a type of non-coding RNA, play an important role in regulating cancer-related cell biology. To simply and accurately predict survival risk for CRA patients, we identied a novel seven-miRNA signature. The microRNA (miRNA) expression proles along with clinical data of 512 CRA patients were downloaded from The Cancer Genome Atlas (TCGA) database, and 415 patients with complete clinical information were further divided into training set and test set randomly. To construct the prognostic signature in the training set, a series of Cox regression analyses were performed, including univariate regression, least absolute shrinkage and selectionator operator (LASSO) - Cox regression, and multivariate regression. Results Seven predictive miRNAs (miR-153-2, miR-3199-2, miR-144, miR-887, miR-561, miR-3684, and miR-505) were ultimately screened. The ROC curves for 5-year survival in the training set, test set and entire set were 0.889, 0.742, and 0.816, respectively. Kaplan-Meier analysis results of the three sets all showed p <0.05. Further analyses demonstrated that the signature was an independent prognostic risk factor for CRA patients, and its predictive ability was superior to age and TNM stage. Functional enrichment analysis revealed that p53 and ErbB pathways were related to the prognostic regulation of miRNAs in the signature in CRA patients. Conclusion Our study demonstrated the potential of this novel seven-miRNA signature to independently predict overall survival in patients with CRA. Functional enrichment analysis further revealed the possible regulatory role of miRNAs in the signature in CRA-related cell biological processes and signaling pathways. enrichment analysis results suggested that the miRNAs in the seven-miRNA signature may be related to a variety of processes and pathways related to tumorigenesis and progression, further supporting the power of the seven-miRNA signature for predicting the prognosis of patients with CRA.

entire set were 0.889, 0.742, and 0.816, respectively. Kaplan-Meier analysis results of the three sets all showed p <0.05. Further analyses demonstrated that the signature was an independent prognostic risk factor for CRA patients, and its predictive ability was superior to age and TNM stage. Functional enrichment analysis revealed that p53 and ErbB pathways were related to the prognostic regulation of miRNAs in the signature in CRA patients.
Conclusion Our study demonstrated the potential of this novel seven-miRNA signature to independently predict overall survival in patients with CRA. Functional enrichment analysis further revealed the possible regulatory role of miRNAs in the signature in CRA-related cell biological processes and signaling pathways.

Background
Colorectal adenocarcinoma (CRA) is the main form of colorectal cancer (CRC), accounting for more than 95% of CRCs, and has high morbidity and mortality. 1 Due to its high heterogeneity, traditional predictors such as age and TNM stage are not su cient to accurately predict the survival risk of CRA patients. Exploring novel biomarkers is necessary to provide effective and personalized predictive tools. For the past few years, investigators have carried out a series of explorations in this eld, and several prognostic gene signatures, 2,3 transcriptional signatures and noncoding RNA signatures have been published. [4][5][6] However, there is still no recognized prognostic prediction model, and further research is required.
MicroRNAs (miRNAs) are a class of endogenous non-coding single-stranded RNA molecules about 20~25 nucleotides with regulatory functions, and participates in a series of important processes in life, including early development, cell proliferation, apoptosis, cell death, fat metabolism and cell differentiation. Many miRNA expression pro les related to particular malignancies have been found to have tumor-suppressive or oncogenic roles in different cancer types and further affect the prognosis of patients. Moreover, the functions of miRNAs are involved in the occurrence, development and metastasis of tumors. 7 For instance, Mirzaei et al reported that miR-29b has signi cant tumor-suppressive effects on chronic lymphocytic leukemia (CLL). 8 In addition, Zhou Y et al discovered that miR-130a acts as an oncogenic miRNA in gastric cancer. 9 In our study, a seven-miRNA prognostic signature was developed through data mining of The Cancer Genome Atlas (TCGA) database. The prognostic model can identify high-risk patients with lower survival rates to allow more effective targeted therapy to be offered. Functional enrichment analysis clari ed that the screened miRNAs participated in many CRA-related cell biological processes and signaling pathways.
Our ndings not only demonstrated the potential power of the seven-miRNA prognostic model for predicting the survival of CRA patients, but also provide new ideas for targeted therapy in patients with CRA..

Methods
Acquisition and processing of the data sets MiRNA expression data along with clinical information for 521 CRA patients were downloaded from the TCGA database at http://cancergenome.nih.gov/ (up to March 2020). From the miRNA expression data, the miRNAs with minimal expression levels (<1) were rst ltered out, and the data were then processed via a standard procedure using the "edgeR" package (version 3.6.2). Finally, 385 differentially expressed miRNAs were identi ed between cancer tissues and normal tissues in CRA patients; a |log2FC| of >1 and an adjusted P-value of <0.05 were used as cutoff criteria. The heatmap of the top signi cantly differentially expressed miRNAs was generated with the "ggplot2" package (version 3.6.3). Patients with incomplete clinical information (including sex, age, TNM stage, survival status and survival time) were excluded from further analysis. After the above exclusion steps, 415 CRA patients were enrolled in and divided into training set (n = 208) and test set (n = 207) randomly.
Screening the prognosis-related miRNAs and modeling in the training set First, the regression analysis of univariate Cox proportional hazards was performed in the training set to analyze the correlation between the expression level of each miRNA and the patient's survival time and status via the R package "survival". Second, least absolute shrinkage and selectionator operator (LASSO) -Cox regression was conducted on the miRNAs with a P-value of <0.05 in the rst analysis. LASSO is a compression estimate method that can produce a more re ned model by constructing a penalty function. In this study, we used the R package "glmnet". To determine the value of penalty parameter λ, crossvalidation was carried out rst, and the value of λ resulting in the minimum mean cross-validated error was selected. 10 Third, the selected miRNAs were included in the multivariate Cox regression analysis to calculate their respective coe cients. Last, the following seven-miRNA prognostic signature was constructed: seven-miRNA signature= (βi * Expi), where i is the name of the prognostic miRNA and β is the coe cient of that miRNA in the third analysis. MiRNAs with a positive coe cient were considered as protective miRNA, while those with a negative coe cient were considered as oncogenic miRNA.
E cacy veri cation of the seven-miRNA signature in the three data sets To further assess the e cacy of the seven-miRNA signature, we calculated the score of each patient in the three data sets (the training set, the test set and the entire set) based on this seven-miRNA signature. According to the median, each set was divided into a high-risk group and a low-risk group, and then the Kaplan-Meier survival analysis was applied to compare the overall survival difference between the highrisk group and the low-risk group using the R package "survival". To verify the e cacy of the seven-miRNA signature, time-dependent receiver operating characteristic (ROC) curves and the areas under the ROC curves (AUCs) were generated via the R package "timeROC".
Combination of clinical factors to evaluate the e cacy of the seven-miRNA signature In the entire data set, the seven-miRNA signature combining clinical factors (including age, gender, race, tumor site and TNM stage) was analyzed by univariate Cox regression to identify associations between these miRNAs and overall patient survival. The variables with a P-value of <0.05 were included in further horizontal and vertical comparisons. ROC curve analysis was performed with the R packages "plotROC" and "ggplot2" to horizontally compare the seven-miRNA signature with clinical factors related to the prognosis of CRA. Kaplan-Meier survival curves were used for strati ed longitudinal analysis.

MiRNA target gene prediction and functional enrichment analysis
Target genes of the seven selected miRNAs were predicted via the following three miRNA databases: miRTarBase (http://mirtarbase.mbc.nctu.edu.tw/, version: 7.0), TargetScan (http://www.targetscan.org/, version: Human 7.2) and miRDB (http://mirdb.org/). The intersection of the results obtained from the three databases was considered the set of miRNA-related genes. Functional enrichment analyses of these miRNA-related genes were performed on the Database for Annotation, Visualization, and Integrated Discovery (DAVID, version 6.8) at https://david.ncifcrf.gov/, which could conduct Gene Ontology (GO) biological process (BP) enrichment analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis. The results with a P-value of < 0.05 were visualized using the R packages "Cairo" (version 3.6.3) and "ggplot2" (version 3.6.3).

Patient data
MiRNA expression les and clinical information for 521 CRA patients (comprising 529 tumor samples and 11 normal tissues) were downloaded from the TCGA database. A total of 415 CRA patients with complete clinical data were enrolled in further analysis. All enrolled patients had primary adenocarcinoma, did not have a previous or concurrent malignancy, and received no chemotherapy or radiotherapy before surgery. After differential miRNA expression analysis, the 415 CRA patients were randomly divided into two sets (Table 1).

Differentially expressed miRNAs between cancer tissues and normal tissues in CRA patients
Before analysis, data normalization and log2 transformation were conducted in the miRNA expression les rstly using R software (version 3.6.1). We nally identi ed 385 differentially expressed (231 upregulated and 154 downregulated) miRNAs (|log2FC| >1 and the adjusted P-value <0.05) using the "DESeq" package. The heatmap of the expression of differentially expressed miRNAs in Figure 1 could clearly distinguish the CRA samples from the controlling normal samples.
Signature identi cation of the prognosis-related miRNA in the training set Univariate Cox regression analysis with the LASSO method was used in the training set for selection of CRA prognosis-related miRNAs among the 385 differentially expressed miRNAs ( Figure 2). Then, through a stepwise multivariate Cox regression analysis to maximize accuracy and validity, seven miRNAs were nally screened, and the prognosis-related miRNA signature was constructed based on the expression levels of these seven miRNAs. The signature formula was as follows: seven-miRNA signature= hsa-miR-144×−0.255 + hsa-miR-153-2×0.472 + hsa-miR-505×0.475 + hsa-miR-887×0.565 + hsa-miR-3199-2×0.509+hsa-miR-561×0.452 + hsa-miR-3684×0.516. The higher concordance index (C index), the better model performance, and the C index of the seven-miRNA signature was 0.772.
Among these miRNAs, hsa-miR-144 had a negative coe cient in the risk formula, indicating that it might be considered a protective miRNA; that is, the higher the expression level of this miRNA, the longer was the survival time of the patient. The remaining six miRNAs in this risk formula had positive coe cients, pointing that higher expression levels of these miRNAs were associated with shorter patient survival times.
Evaluation of the survival prediction capability of the seven-miRNA signature in the training set The 208 samples in the training set were scored according to the risk scoring formula and divided into the high-risk group (n=104) and the low-risk group (n=104), with a median value of 1.009 as the cutoff point. The Kaplan-Meier overall survival curves of the two groups showed statistically signi cant differences between patient survival (log-rank test P=5e−08) ( Figure 3A). ROC curves along with AUC values were further generated to evaluate the prognostic ability of the seven-miRNA signature ( Figure 3B). The AUCs for 3-and 5-year survival in the training set were 0.769 and 0.889, respectively, indicating that the signature had good performance in predicting CRA patients' survival risk. In addition, the distributions of the risk score, survival status and expression pro le of the seven miRNAs in the training set were illustrated to visualize the analysis results ( Figure 4). Patients with higher risk scores had poorer overall survival than patients with lower risk scores, as shown in the gure. The expression levels of the protective miRNA hsa-miR-144 were signi cantly decreased in patients with high risk scores, whereas those of the remaining six oncogenic miRNAs (miR-153-2, miR-505, miR-887, miR-3199-2, miR-561, and miR-3684) were increased.
Performance veri cation of the seven-miRNA signature in the test set and entire set To evaluate the performance of the seven-miRNA signature for predicting the survival of CRA patients, Kaplan-Meier survival analysis was conducted and ROC curves with AUC values were generated in the test set (n= 207) and the entire set (n= 415). Patients in the two validation sets were classi ed into a highrisk group and a low-risk group according to the constructed seven-miRNA signature. As shown in the Kaplan-Meier survival curve for the test set, patients in the high-risk group (n= 103) had signi cantly shorter overall survival times than those in the low-risk group (n= 104) (P = 6e -03, Figure 5A). In the test set, the AUC for 3-year survival was 0.666, and that for 5-year survival was 0.742 ( Figure 5B). Similar results were obtained in Kaplan-Meier survival analysis of the entire TCGA set, in which the patients in the high-risk group (n= 207) had signi cantly shorter overall survival times than those in the low-risk group (n= 208) (P= 0, Figure 6A). In the entire TCGA set, the AUC for 3-year survival was 0.722, while that for 5year survival was 0.816 ( Figure 6B). These results con rmed the power of the seven-miRNA signature for predicting the prognosis of CRA patients.
To better demonstrate differences in the survival outcomes and miRNA expression levels between CRA patients in the high-risk and low-risk groups, we visualized the distributions of the risk scores, survival statuses and miRNA expression levels of patients in the test set and the entire TCGA set. The overall survival times of patients in the high-risk group was signi cantly shorter than those of patients in the lowrisk group; moreover, one protective miRNA, hsa-mie-144, was highly expressed in the low-risk group, while the remaining 6 oncogenic miRNAs were highly expressed in the high-risk group, recon rming the results in the training set. (Figure 5C, Figure 6C).
Combination of routine clinical factors for e cacy evaluation of the seven-miRNA signature Clinical factors play an important role in tumor prognosis. To construct a more stable and effective prognostic model, the seven-miRNA signature combined with conventional clinical parameters (including age, race, sex, tumor site and TNM stage) was analyzed by univariate and multivariate Cox regression along with strati cation analysis to evaluate the associations of these parameters with the overall survival in the entire data set. Variables with a P-value of <0.05-i.e., the seven-miRNA signature, age and TNM stage-were considered to be associated with prognosis, as shown in Table 2. In addition, the conclusion can be made that the seven-miRNA signature is an independent risk factor for the prognosis of CRA patients.
Then, ROC curves and AUC values were used to horizontally compare the e cacy of the seven-miRNA signature with those of patient age and TNM stage in predicting the prognosis of patients with CRA. First, the seven-miRNA signature, age, and a new variable combining both of these variables were included in the analysis. As shown in Figure 7A, the seven-miRNA signature showed a higher predictive power than the age. Then, the seven-miRNA signature, TNM stage, and a new variable combining both of these variables were included; similarly, the predictive power of the seven-miRNA signature was superior to that of the TNM stage ( Figure 7B).
Finally, analysis was performed on the 415 patients strati ed by age and TNM stage to compare the risk score of the seven-miRNA signature. The patients were divided by the average age in years into the younger group (<65, n=189) and the aged group (≥65, n=226. In the younger group, the seven-miRNA signature divided the patients into a high-risk group (n=95) and a low-risk group (n=94) according to the median risk score. Kaplan-Meier survival curves showed that the difference between the two risk groups was statistically signi cant (P=5.891e−05, Figure 8A&B). The corresponding values were also statistically signi cant (P=1e−07) in the aged group, which was also divided into the high-and low-risk groups (n=112 and 114, respectively). Then, the patients were classi ed by TNM stage into the early-stage group (stages I+II, n=229) and the advanced-stage group (stages III+IV, n=186). In both stage groups, the seven-miRNA signature separated the patients into a high-risk group and a low-risk group independently. Similarly, the Kaplan-Meier survival curves for both stage groups indicated that the overall survival times of patients in the high-risk groups were signi cantly shorter than those of patients in the low-risk groups (P=5.891e−05, 0.00011855, Figure 8C&D).

Target gene prediction and functional enrichment analysis of the prognostic miRNAs
To investigate the cellular processes and biological pathways associated with the seven prognostic miRNAs, miRNA target prediction was separately performed with miRDB, miRTarBase7.0 and TargetScan.
MiRDB identi ed 9297 miRNA-related genes, 11 while TargetScan, 12 the most comprehensive miRNA target gene database, identi ed 38678 mRNAs. MiRTarBase, 13 the experimentally validated miRNA-target interaction database, identi ed 152 mRNAs associated with the seven prognostic miRNAs. The intersection of the results from these three databases was identi ed as the set of miRNA-related genes (n=67, Figure 9B) for further functional enrichment analyses. First, GO BP and KEGG pathway enrichment analyses were performed with these miRNA-related genes via DAVID. 14,15 Then, the genes with a P-value of < 0.05 (Table 3) were visualized using the R packages "Cairo" and "ggplot2" ( Figure 9A&C). The GO and KEGG analysis results showed that these genes are involved in cellular processes associated with cancer, such as regulation and negative regulation of apoptosis, and in biological pathways such as the p53 signaling pathway. The functional enrichment analysis results indicated the potential roles of the seven prognostic miRNAs in the regulation of CRA tumorigenesis and progression via targeting of their associated mRNAs.

Discussion
CRC is the world's third most common cancer, 16 accounting for approximately 10% of all new cancer cases annually worldwide. 17 CRA is the most common histological type of CRC. Although CRC screening for earlier detection has somewhat reduced morbidity and mortality in recent years, the ve-year survival rate of CRA patients remains unsatisfactory, 18 and the molecular mechanisms underlying CRA progression are still elusive. CRC is a highly heterogeneous tumor, and few studies have focused speci cally on the prognosis of CRA. Developing novel predictive biomarkers is important and needed for obtaining a better understanding of CRA and providing diagnostic and therapeutic strategies. In this study, we conducted a series of analyses on miRNA expression and corresponding clinical data for training and validation set generated from 415 CRA patients in the TCGA database, with an aim to develop a meaningful predictive model related to CRA prognosis.
MiRNAs, an important subset of noncoding RNAs, have been recognized to play multifaceted roles in controlling cellular functions by repressing their target genes. 19 For example, let-7 family members can regulate cancer stem cells by targeting HRAS and HMGA2, 20 and miR-21, which is highly expressed in breast cancer, is correlated with poor patient survival. 21 Virtually all CRCs exhibit altered miRNA expression; 22 for instance, miR-22 contributes to tumorigenesis in CRC, 23 and downregulation of miR-34 results in CRC metastasis via an increase in IL6 signaling pathway activity. 24 These ndings suggest the potential of miRNAs as excellent prognostic biomarkers for CRA survival prediction. The heat map generated in our study intuitively indicates that compared with the normal tissues, the CRA tissues exhibited obvious miRNA expression abnormalities, proving the previous statement by Okugawa that almost all CRCs exhibit altered miRNA expression.
Through a comprehensive analysis including Cox regression with the LASSO method on the training set, seven prognostic miRNAs were nally screened. Model evaluation is dependent on the values of the C index and AUC. The higher the C index and AUC values, the better is the model. 25 Our risk score formula, termed the seven-miRNA signature, was established with a C index of 0.772. The AUC for 5-year survival was 0.889 in the training set and 0.816 in the entire TCGA set, indicating that the signature had good predictive power. Several types of CRC prognostic score models have recently been developed and veri ed, including an immunoscore with a C index of 0.612 and AUC of 0.630 for 5-year survival, 26 a lncRNA score with an AUC of 0733 for 5-year survival, 27 and so on. However, prognostic models comprising miRNAs for use with CRA patients have not been reported, and our signature lled this gap.
Moreover, the performance of the seven-miRNA signatures was evaluated in the training set and veri ed in the test set and the entire set using ROC curve and Kaplan-Meier survival analysis. The seven-miRNA signature performed well in distinguishing whether patients with CRA had a high or low prognostic risk.
The heat map for the three data set clearly shows the expression patterns of the seven miRNAs in CRA samples. The expression of miR-144 was high in the low-risk group but low in the high-risk group, and the remaining 6 miRNAs exhibited the opposite pattern. These results indicate that miR-144 is the only negatively regulated miRNA with an antitumor effect; thus, miR-144 may be a protective factor for CRA and may become a new therapeutic target. This possibility needs further study. Subsequent GO analysis showed that several of the cancer-related cellular processes and biological pathways related to these genes, such as regulation and negative regulation of apoptosis, positive regulation of cell proliferation, and regulation of protein amino acid phosphorylation, are involved in tumorigenesis and tumor progression. KEGG analysis revealed that the p53 and ErbB signaling pathways were enriched.The p53 signaling has frequently been shown to be associated with poor prognosis in CRC, 28 and a signi cant proportion of CRCs have been reported to exhibit changes in ErbB signaling. 29 These results suggest the participation of the seven miRNAs in CRA development and prognosis through regulation of their target genes.
Furthermore, among the factors related to the prognosis of CRA patients, clinical parameters, such as TNM stage, 30 which is currently a prognostic indicator favored by clinicians, are essential. Cox regression analysis revealed three prognostic factors-age, TNM stage and the seven-miRNA signature, among which the seven-miRNA signature was an independent prognostic risk factor. To gain insight into the prognosis of patients with CRA, we performed a transverse ROC analysis and a strati ed longitudinal analysis on the entire TCGA set. The AUC of the seven-miRNA signature was superior to that of TNM stage and age for predicting prognosis. Then, the strati cation analysis results suggested that the seven-miRNA signature could accurately distinguish high-risk patients from low-risk patients strati ed by TNM stage and age, which again demonstrated the predictive power of the seven-miRNA signature.
To the best of our knowledge, this study is the rst to report the relationship between a seven-miRNA signature and survival in CRA. However, this study has some limitations. First, although the biological functions of the prognostic miRNAs were annotated through functional enrichment analysis, in vivo and in vitro studies are still needed to further reveal the mechanism by which these prognostic miRNAs mediate CRA tumor development and prognosis. Second, in the sample data obtained from the TCGA database, white ethnic groups were the most common, and data for Asian ethnic groups were scarce. Although our study showed that the prognosis of CRA patients was not correlated with the ethnic group, whether differences exist within ethnic groups is unknown. More data from Asian ethnic groups are needed for further investigation.

Conclusion
In conclusion, we performed scienti c data mining for miRNAs from a TCGA dataset of patients with CRA and identi ed a seven-miRNA signature with good predictive performance. The seven-miRNA signature was successfully validated in the test set and the entire set and was demonstrated to be an independent factor for survival prediction and superior to other conventional clinical variables. Functional enrichment analysis results suggested that the miRNAs in the seven-miRNA signature may be related to a variety of processes and pathways related to tumorigenesis and progression, further supporting the power of the seven-miRNA signature for predicting the prognosis of patients with CRA.

Authors' contributions
Shengying Jiang was involved in study designing, writing of this article, data processing, and preparing of gures. Xiaoli Xie was involved in data processing and writing of this article. Huiqing Jiang was involved in developing the idea and proofreading the manuscript.

Availability of data and materials
The datasets analyzed during the current study are available in The Cancer Genome Atlas (TCGA) repository, http://cancergenome.nih.gov/.    Figure 1 Heatmap generated by unsupervised hierarchical clustering of the differentially expressed miRNAs between cancer tissues and normal tissues in CRA patients.