Construction of a Pseudogene ‐ associated ceRNA Network and Identication of Prognostic Pseudogene Biomarkers for Colorectal Cancer

Colorectal cancer (CRC) is one of the most common and deadly malignant carcinomas. Many long noncoding RNAs (lncRNA) have been reported to play an important role in the tumorigenesis of CRC by interacting with miRNAs and inuencing the expression of some mRNAs through a competing endogenous RNA (ceRNA) network. Pseudogenes are one kind of lncRNA and can act as RNA sponges for miRNAs and regulate gene expression via ceRNA networks, but there are few studies about pseudogenes in CRC. In this study, total of 31 differentially expressed (DE) pseudogenes, 17 DE miRNAs and 152 DE mRNAs were identied by analyzing the expression proles of colon adenocarcinoma (COAD) obtained from The Cancer Genome Atlas (TCGA). And a ceRNA network was constructed based on these RNAs. Kaplan–Meier analysis showed that 7 pseudogenes, 4 miRNAs and 30 mRNAs were signicantly associated with overall survival. Then multivariate Cox regression analysis on the ceRNA-related DE pseudogenes was performed and a 5-pseudogene signature with the greatest prognostic value for CRC was identied. What’s more, the results were validated by the Gene Expression Omnibus (GEO) database, and quantitative real ‐ time PCR (qRT ‐ PCR) in 113 pairs of CRC tissues. In conclusion, this study provides a pseudogene-associated ceRNA network and 7 prognostic pseudogene biomarkers, and a 5-pseudogene prognostic risk signature that may be useful to predict the survival of CRC patients. The quantitative real-time polymerase chain reaction (qPCR) was done to detect the DE pseudogenes expression in both colon cell lines and clinical samples. As for cell lines, human normal colon broblastic cell line (CCD-18Co) was purchased from ATCC and cultured in MEM medium plus 15% fetal bovine serum (FBS) and 100 U/mL penicillin/streptomycin (Gibco, USA), and four human colon cancer cell lines (HCT116, SW480, RKO and LoVo) were grown in RPMI1640 medium with 10% FBS and 100 U/mL penicillin/streptomycin at 37°C in a humidied incubator with 5% CO2. 113 paired human CRC tumor and cutting edge tissues were collected and stored at -80°C in Peking University Cancer Hospital, China. Research protocols were approved by the Institutional Review Board of the Peking University Cancer Hospital and Institute. All patients in this study provided written informed consent. The total RNA was extracted using the Direct-zol™ RNA MiniPrep kit (ZYMO RESEARCH, USA) according to the manufacturer's instructions. Then complementary DNA (cDNA) was synthesized using TransScript First-Strand cDNA Synthesis SuperMix (TransGen Biotech, China). Next, reverse transcription-qPCR (RT ‐ qPCR) was performed using the FastStart Universal SYBR Green Master (ROX) (Roche, Germany) on an ABI-7500 Fast system (Applied Biosystems). And GAPDH was used as the endogenous reference gene for the cultured cell lines while ALU for tissues. The expression levels of the survival-associated pseudogenes in the ceRNA network were determined using the typical ΔΔCt method. The correlation with clinical pathological characteristics and survival were also analyzed with chi-square, Student’s t-test and Kaplan-Meier test in SPSS 20.0. P < 0.05 was considered statistically signicant. potential biomarkers for the diagnosis of CRC and prognosis prediction for CRC patients were discovered through identication of the 5-pseudogene signature and clinical analysis.


Introduction
According to the GLOBOCAN 2018 assessment on cancer incidence and mortality published by the International Agency for Research on Cancer, colorectal cancer (CRC) was classi ed with the third (10.2%) and second (9.2%) highest incidence and mortality rates respectively among all cancer types [1,2]. The molecular mechanisms for the development of CRC were clinically important as they were tightly correlated with the prognosis and treatment response of patients. Except for the traditional genetic and epigenetic alterations of protein-coding genes, non-coding RNAs were considered play important roles in regulating various biological behaviors, such as cell proliferation, metastasis, apoptosis, differentiation, etc [3]. There're a lot of discoveries that endogenous non-coding RNAs could act as sponges and competitively bind with RNAs in gene expression regulatory networks, which could in uence cell fate decisions in both developmental as well as pathological cancer contexts [4].
Competitive endogenous RNAs (ceRNAs) are transcripts that can regulate each other at post-transcription level by competing for shared miRNAs. CeRNA networks link the function of protein-coding mRNAs with that of non-coding RNAs such as microRNA, long non-coding RNA, pseudogenic RNA and circular RNA [5].
They play an important role in the occurrence and development of colorectal cancer [6]. However, among the several non-coding RNAs, pseudogenic RNA associated ceRNAs have been rarely studied in CRC.
Pseudogenes may derive from gene mutations, or unfaithful gene duplications, or retrotransposition of processed mRNAs back into the genome. Accordingly, pseudogenes can be categorized into three types: (1) unitary pseudogenes, (2) duplicated or unprocessed pseudogenes and (3) processed or retrotransposed pseudogenes [7,8]. There're an increasing number of researches showing that pseudogenes were involved in the occurrence and development of cancer through ceRNA networks. For example, pseudogene PTENP1 could be targeted by multiple PTEN-targeting miRNAs and then regulate the protein level of PTEN [9]. For example, miR-21 could downregulate the expression of PTENP1 pseudogene and PTEN gene and further affect the growth of hepatocellular carcinoma cells [10]. PTENP1 could bind with miR-200c and reverse its inhibition function to PTEN in the development of endometrioid endometrial carcinoma [11]. PTENP1 could also interact with miR-20a to suppress breast cancer progression [12]. There're also a few researches about how pseudogenes affect CRC tumorigenesis and development through the regulation of ceRNA networks. For example, pseudogene DUXAP8 could promote colon cancer cell proliferation, migration and invasion by targeting tumor suppressor miR-577 and promote the expression of oncogene RAB14 [13]. Pseudogene FLT1P1 could promote VEGFR1 and VEGF-A expression by interact with miR-520a, thus contribute to CRC cell growth [14].
In this study, we rstly comprehensively analyzed the aberrantly expressed pseudogenes, miRNAs and mRNAs of large-scale samples in the COAD dataset of TCGA and constructed the pseudogene-associated ceRNA network for CRC. We also discovered some novel pseudogenes and mRNAs that were signi cantly related to overall survival of patients with CRC, and identi ed a ve-pseudogene prognostic risk signature. More importantly, these results were validated in a dataset GSE50760 from GEO database and qRT-PCR experiment in our CRC samples.

TCGA Data collection and processing
The RNA-seq data of 521 samples and miRNA-seq data of 465 samples with colon adenocarcinoma were retrieved from the TCGA data portal (https://portal.gdc.cancer.gov/). R software and the package GDCRNATools were applied to read the RNA-seq sample sheet and remove the repetitive samples and the samples that were not primary tumor. Finally, 469 primary CRC tumors and 41 normal tissues in total were collected. The RNA-seq data contained more than 60,000 genes including non-coding genes with the Ensembl Gene ID. For miRNAs, a matrix of 451 primary tumors and 8 solid normal tissues was built with the expression level of all the genes. The miRNA-seq data included more than 2,500 miRNAs with annotated miRNA ID. Besides, the corresponding clinical information was also downloaded. The sample sheets provided the information of case ID, sample ID, sample type and clinical information such as race, ages, gender, pathologic stage, vital status, days to death or days to last follow up of the patients. This study was in accordance with the publication guidelines provided by TCGA (https://cancergenome.nih.gov/publications/publicationguidelines).

Identi cation of differentially expressed (DE) genes
The edgeR package was used to gure out the DE pseudogenes, miRNAs and mRNAs between primary tumors and normal tissues, with the threshold setting at an adjusted P-value < 0.01 and |log2-fold change (FC)|≥1. And heat maps and volcano maps of the DE pseudogenes, miRNAs and mRNAs were also generated using the gdcHeatmap and gdcVolcanoPlot3 of GDCRNATools package in the R platform.

Construction of the ceRNA network
The miRcode database (http://www.mircode.org/) was used to predict the interactions between pseudogenes and miRNAs in COAD dataset of TCGA. The miRNAs targeted mRNAs were retrieved using miRTarBase, miRDB and TargetScan databases, only the miRNA-targeted mRNAs presenting in all three databases were included to construct the ceRNA network. Cytoscape 3.6.1 software (https://cytoscape.org/) was used to visualize the ceRNA network. The Sankey diagram was constructed by using "ggalluvial" and "ggplot2" packages in R software to show the interactions between the survivalassociated pseudogenes and mRNAs, along with their matched miRNAs in the ceRNA network.

Identi cation of a 5-pseudogene prognostic risk signature
The 453 CRC patients were randomly divided into a training cohort (n = 227) and a validation cohort (n = 226). Multivariate Cox regression analysis was performed to identify the prognostic model for the pseudogenes in the ceRNA network, and the risk score of the patients with CRC was calculated according to the expression level of the involved pseudogenes weighted by the regression coe cient (βpseudogenes), as follows: Risk score = expression of pseudogene1 * β1pseudogene1 + expression of pseudogene2 * β2pseudogene2 + ··· expression of pseudogeneN * βNpseudogeneN. The pseudogene prognostic model was constructed based on the training cohort and then validated in the validation cohort. According to the risk score of the prognostic model, the CRC patients were divided into two groups of low-risk and high-risk by the median risk score value. Then the Kaplan-Meier analysis was conducted by R package "survival" to generate the overall survival (OS) curve for the two groups. And ROC curve analysis was conducted by package "survival ROC" to evaluate the accuracy of the prognostic model of 1, 3, and 5-year survival. Besides, a risk heatmap for the pseudogenes involved in the ceRNA network of the patients with CRC was plotted by R package "pheatmap" combining the gene expression and clinical survival data.

Survival analysis and correlation analysis
The R package "survival" was used for survival analysis for the DE RNAs involved in the ceRNA network and plotting Kaplan-Meier curves. The R package "ggcorrplot" was applied to make Pearson correlation coe cient analysis between the survival-associated mRNAs and pseudogenes in the ceRNA network. The survival-associated pseudogenes were analyzed with clinical pathological characteristics of CRC patients using chi-square test and t-test in SPSS 20.0 software. P < 0.05 was considered statistically signi cantly.

Validation with a dataset in GEO
The dataset GSE50760 was downloaded from the Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/) database to validate the expression level of the survival-related genes in CRC and normal tissues. The GSE50760 dataset is an expression pro ling by high throughput sequencing, containing RNA-seq data of 54 samples (normal colon, primary CRC, and liver metastasis) generated from 18 CRC patients [15]. Student's t-test was conducted for normally distributed data while the Mann-Whitney U-test was conducted for nonnormally distributed data in SPSS 20.0 software, with statistical signi cance assigned at P < 0.05.

Quantitative real-time polymerase chain reaction validation
The quantitative real-time polymerase chain reaction (qPCR) was done to detect the DE pseudogenes expression in both colon cell lines and clinical samples. As for cell lines, human normal colon broblastic cell line (CCD-18Co) was purchased from ATCC and cultured in MEM medium plus 15% fetal bovine serum (FBS) and 100 U/mL penicillin/streptomycin (Gibco, USA), and four human colon cancer cell lines (HCT116, SW480, RKO and LoVo) were grown in RPMI1640 medium with 10% FBS and 100 U/mL penicillin/streptomycin at 37°C in a humidi ed incubator with 5% CO2. 113 paired human CRC tumor and cutting edge tissues were collected and stored at -80°C in Peking University Cancer Hospital, China. Research protocols were approved by the Institutional Review Board of the Peking University Cancer Hospital and Institute. All patients in this study provided written informed consent.
The total RNA was extracted using the Direct-zol™ RNA MiniPrep kit (ZYMO RESEARCH, USA) according to the manufacturer's instructions. Then complementary DNA (cDNA) was synthesized using TransScript First-Strand cDNA Synthesis SuperMix (TransGen Biotech, China). Next, reverse transcription-qPCR (RT-qPCR) was performed using the FastStart Universal SYBR Green Master (ROX) (Roche, Germany) on an ABI-7500 Fast system (Applied Biosystems). And GAPDH was used as the endogenous reference gene for the cultured cell lines while ALU for tissues. The expression levels of the survival-associated pseudogenes in the ceRNA network were determined using the typical ΔΔCt method. The correlation with clinical pathological characteristics and survival were also analyzed with chi-square, Student's t-test and Kaplan-Meier test in SPSS 20.0. P < 0.05 was considered statistically signi cant.
The following primer sequences were used in this study: DDX12P, forward,

Functional annotation and pathway analysis for DE mRNAs
The DE mRNAs involved in the ceRNA network were input into WebGestalt (http://www.webgestalt.org/), a functional enrichment analysis web tool to study the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway and gene ontology (GO) function of these genes. Then R package "dplyr" and "ggplot2" was used to analyze the results data and plotted the maps of the top 10 most signi cant items of the

Construction of the ceRNA network in CRC
In order to better understand the interactions among these DE pseudogenes, DE miRNAs and DE mRNAs in CRC, we constructed a pseudogene-miRNA-mRNA related ceRNA regulatory network. First, we found 31 of 74 DE pseudogenes could be targeted by 186 miRNAs by the prediction of miRcode database. And of the 186 targeted miRNAs there're only 17 miRNAs that overlapped with the 340 DEmiRNAs ( Fig. 2A). So the 17 miRNAs were selected to predict their target mRNAs through miRTargetBase, miRDB and TargetScan databases. There were 430 mRNAs, 844 mRNAs and 854 mRNAs could be targeted by the 17 miRNAs and overlapped with the DE mRNAs respectively in the 3 databases. In order to improve the validity of the targeted DE mRNA, only 152 mRNAs that presented in all the three databases were selected to construct the ceRNA network (As shown in Fig. 2B). Finally, we incorporated the 31 pseudogenes, 17 miRNAs and 152 mRNAs to build the ceRNA network using the Cytoscape software, and the visualized map was shown in Fig. 2C. The DE pseudogenes, miRNAs and mRNAs were listed in Supplemental table1.

Identi cation of survival-related DE pseudogenes in the ceRNA network
In order to explore the prognostic value of the DE pseudogenes, DEmiRNAs and DEmRNAs involved in the ceRNA network of CRC, we conducted Kaplan-Meier curve analysis using R software for the CRC patients from TCGA database. As shown in the results, 7 of the 31 pseudogenes had signi cant relationship with overall survival (p < 0.05). Among these 7 pseudogenes DDX12P, NCF1C, FER1L4, NSUN5P2, PLEKHA8P1 and RP9P were negative with overall survival, while GVINP1 was positive with overall survival (Fig. 3).
Correlation analysis for the expression of these genes with clinicopathological factors of patients in TCGA-COAD showed that DDX12P, FER1L4, GVINP1, PLEKHA8P1 and RP9P were related with the T, N, M or Pathologic stage (Supplemental table2).
In addition, there were 4 of 17 DE miRNAs and 30 of 152 DE mRNAs in the above ceRNA network were signi cantly correlated with overall survival of patients with CRC (Supplemental table1). The Pearson correlation coe cient analysis between the survival-related pseudogenes and mRNAs was carried out by using R package "ggcorrplot", as shown in Fig. 4A. Furthermore, we constructed a sankey diagram using the R package "ggalluvial" to visualize the interaction network among the 7 prognosis-related pseudogenes and the 30 prognosis-related mRNAs through binding with 16 DE miRNAs (Fig. 4B).

Construction of the 5-pseudogene prognostic risk signature
To better understand the prognostic value of the aberrantly expressed pseudogenes in CRC, we calculated the risk scores of DE pseudogenes through multivariate Cox regression analysis based on TCGA samples. The total 453 CRC patients were randomly divided into a training cohort (n = 227) and a validation cohort (n = 226), and no signi cant differences was found between the two groups (Supplemental table 3).
There were ve pseudogenes (NCF1C, YWHAZP4, RP9P, DDX12P and PLEKHA8P1) displaying a remarkable prognostic value in the multivariate Cox regression model in the training cohort (Supplemental table 4), only YWHAZP4 was not found to be prognosis-related pseudogenes. The risk scores were calculated using the formula as follows: risk score =(0.002045 × expression level of DDX12P) + (0.003879× expression level of NCF1C) + (0.003856× expression level of PLEKHA8P1) + (0.001913× expression level of RP9P) -(0.006358× expression level of YWHAZP4). All the CRC patients were ranked by the risk score and divided into low-risk (n = 114) and high-risk (n = 113) groups. The Kaplan-Meier curve showed signi cant poorer prognosis in the high-risk group compared with low-risk group with p = 0.0059 (Fig. 5A). And the AUC curve was used to evaluate the e cacy to predict 1, 3, and 5year survival of the CRC patients, and they're 0.632, 0.672 and 0.652 in the training cohort while 0,554, 0.644 and 0.785 in the validation cohort respectively (Fig. 5D). According to the risk score heatmap, the expression level of NCF1C, RP9P, DDX12P and PLEKHA8P1 were upregulated while the expression level of YWHAZP4 decreased with the increase of risk scores (Fig. 5B). The results of validation cohort were consistent with the results of training cohort (Fig. 5 right row), which suggested the e ciency of this 5pseudogene prognostic risk signature.

Validation of the DE pseudogenes through a GEO dataset and our CRC samples
In order to validate the expression alteration of the seven DE pseudogenes between normal and primary CRC tissues, we analyzed the differentially expressed 7 pseudogenes in a GEO dataset, CRC samples from our hospital and colon cell lines at the same time. GSE50760 dataset was downloaded from GEO database which contained the expression pro ling of high throughput sequencing of 18 paired normal colon tissue and primary colorectal cancer. All the seven pseudogenes' differentially expressions were con rmed in 113 pairs of CRC samples from Peking University Cancer Hospital. Five of the seven DE pseudogenes were also validated in GSE50760 and ve colon cell lines (Fig. 6).

The function and pathway enrichment analysis of DE mRNA
In order to explore the possible regulating mechanisms of the prognosis-related DE pseudogenes in CRC, the 152 DEmRNAs involved in the pseudogene-miRNA-mRNA related ceRNA network were further selected to do the GO annotation and KEGG pathway enrichment analysis to analyze the possible function and molecular pathways of these genes. The GO BP analysis showed that the DE mRNAs were mainly involved in regulation of cell differentiation, cell migration and locomotion and so on (Fig. 4A). The GO CC analysis showed that many of these mRNAs might be the component of receptor complex (Fig. 4B).
And the GO MF analysis suggested that these mRNAs probably played roles in regulating transcription of some genes because they were signi cantly associated with transcription regulatory region DNA binding, RNA polymerase II regulatory region DNA binding, regulatory region nucleic acid binding, DNA − binding transcription activator activity and so on (Fig. 4C). What's more, KEGG pathway analysis revealed that these mRNAs had a clear relationship with cancer, because they were enriched in the pathways such as microRNAs in cancer, EGFR tyrosine kinase inhibitor resistance, pathways in several kinds of cancers (Fig. 4D).

Discussion
Pseudogenes are abundant in human genome and conventionally considered as non-functional "junk genes." However, recent studies have revealed that pseudogenes are aberrantly expressed in cancers and may function in tumorigenesis through pseudogene-derived lncRNA transcripts. They may serve as ceRNAs through competitively binding to shared miRNAs, thus affect both their cognate genes and unrelated genes and play an important role in regulating ceRNA networks [17,18]. In order to deeply explore the underlying mechanisms for CRC carcinogenesis and development and investigate novel candidate biomarkers for CRC diagnosis and prognosis prediction, in this study, we identi ed aberrantly expressed pseudogenes, miRNAs and mRNAs between CRC tumor tissues and adjacent normal tissues and constructed a ceRNA network. This may be the rst report about the regulating network among pseudogenes, miRNAs and mRNAs in CRC as we know.
There're some evidences that the prognostic-related pseudogenes discovered in this study could regulate tumorigenesis and tumor development through ceRNA regulatory networks. For example, FER1L4 often acted as a tumor suppressor by interacting with some oncogenic miRNAs and inhibited the progression of multiple cancers such as glioma [19], prostate cancer [20], gastric cancer [21] and hepatocellular carcinoma [22]. And it could also inhibit proliferation, migration and invasion of colon cancer cells by associating with oncogene miR-106a-5p [22,23]. GVINP1 was downregulated in lung cancer and related with poor prognosis of patients with lung cancer [24,25]. NSUN5P2 was found unfavorable for the prognosis of hepatocellular carcinoma through bioinformatic analysis [26]. But for the remaining pseudogenes (DDX12P, NCF1C, PLEKHA8P1, RP9P, and YWHAZP4), there were no reports on their biological functions in cancer. In our study, we found YWHAZP4 was a protective factor for CRC patients while DDX12P, NCF1C, PLEKHA8P1 and RP9P were adverse factors for CRC patients. Thus, these ve pseudogenes might be novel prognostic biomarkers for CRC.
From the correlation analysis, we found the expression levels of some prognosis-related pseudogenes were signi cantly correlated with many prognosis-related protein-coding genes that had been reported to play roles in CRC and other kinds of cancers. For example, pseudogene DDX12P and FER1L4 were positively correlated with DNMT3B which was an important tumor associated gene and aberrantly expressed or mutant in many kinds of cancers [27,28]. Pseudogene GVINP1 and NCF1C highly correlated with the expression levels of many mRNAs such as PDGFRA, ENPP2, HAPLN1, SOCS6 and SNCG, they had all been reported to participate in the tumorigenesis and progression of CRC in many studies [29][30][31][32][33][34]. PLEKHA8P1 and RP9P seemed co-expressed with gene CCND1 and SNAI1, which also could contribute to CRC and other cancers progression [35,36]. The underlying regulatory mechanisms of these pseudogenes and functions in cancer development and the prognosis of CRC are worth to being further studied.
In conclusion, the prognostic-associated pseudogene ceRNA network provide a way to uncover the underlying regulatory functions and mechanisms of pseudogenes in CRC, and some novel potential biomarkers for the diagnosis of CRC and prognosis prediction for CRC patients were discovered through identi cation of the 5-pseudogene signature and clinical analysis.

Declarations
Funding This study was supported by the National Natural Science Foundation of China (Grant Numbers: 82073107 and 81773036), and Natural Science Foundation of Beijing (Grant Number: 7162040).

Con ict of Interest
The authors declare no con ict of interest.

Availability of data and material
The datasets used or analyzed during the current study are available from the corresponding author on reasonable request.

Ethics approval
Research protocols were approved by the Institutional Review Board of the Peking University Cancer Hospital and Institute. All patients in this study provided written informed consent.

Author Contributions
ZL: designed and initiated the study, analyzed the data, conducted qRT-PCR, and prepared data for manuscript. JZ and LG: collected the colon cancer tissues and prepared RNA and cDNA. BZ: codesigned the study, analyzed the data, and revised the manuscript. All authors read and approve the nal version of the manuscript.