A prognostic model constructed by CTHRC1 and LRFN4 in Stomach adenocarcinoma by Bioinformatics Analysis

Stomach adenocarcinoma (STAD) most common type of a considerable number of worldwide. This study specically aimed to identify potential the Gene expression proles microarray data were downloaded from the Gene Expression Omnibus (GEO) database. The ‘limma’ R package was used to screen the differentially expressed genes (DEGs) between STAD and matched normal tissues. The Database for Annotation, Visualization, and Integrated Discovery (DAVID) was used for function enrichment analyses of DEGs. The data of STAD cases with both RNA sequencing and clinical information of The Cancer Genome Atlas (TCGA) were obtained from Genomic Data Commons (GDC) data portal. Survival curves were analyzed by the Kaplan-Meier method, univariate Cox regression analysis and multivariate Cox regression were performed using ‘survival’ package. CIBERSORT algorithm used approach to characterize the 22 human immune cell composition. Gene expression proles microarray data and clinical information were downloaded from GEO database to validate prognostic model.

The present study constructed the prognostic model by expression of CTHRC1 and LRFN4 for the rst time via comprehensive bioinformatics analysis, which may provide clinical guidance and potential therapeutic targets for STAD.

Background
Stomach adenocarcinoma (STAD), the most common histological type (~ 95%) of malignancy originating in the stomach, imposes a considerable global health burden [1]. However, there are no sensitive and speci c diagnostic markers for early diagnosis of STAD. Although several drugs, such as trastuzumab, ramucirumab and immune checkpoint inhibitors, have been used for the treatment of STAD in the clinic, the survival rates of patients in advanced stages are low [2][3][4]. Therefore, identifying novel diagnostic biomarkers and developing therapeutic medicines for STAD are necessary.
Over the past decades, the development of high-throughput sequencing has produced large-scale biological data, and it has been an effective tool for discovering promising biomarkers for cancer [5].
Many potential therapeutic targets, including AFP, EGFR and HER2, have been explored by bioinformatic analysis [6][7][8]. Bioinformatic analysis has also played an important role in the potential biomarkers discovery for the diagnosis and treatment of stomach-related cancer: COL4A1 is a therapeutic target for recurrent gastric carcinoma; FN1, SERPINE1, and SPARC signi cantly predict poor prognosis of STAD; and nine hub genes have been identi ed to be strongly correlated with the pathogenesis of gastric cancer [9][10][11]. Despite the partial success of the above studies, there were two weaknesses. First, most of the studies focused on gene expressions and clinical data analyses, while the molecular mechanism of STAD was not extensively elucidated. Second, several biomarkers were revealed. Moreover, the approaches for treating STAD were not provided, which could possibly make clinical treatment di cult.
In the present study, we analyzed the mRNA expression pro les of STAD from Gene Expression Omnibus (GEO) database and obtained the differentially expressed genes (DEGs). Then, gene ontology (GO) analyses were further conducted to analyze the main biological functions modulated by the DEGs. By survival information was analyzed in patients with STAD from The Cancer Genome Atlas (TCGA), the candidate genes closely related to survival rate of patients were identi ed. Cox analysis was used to further screening mRNAs correlated with survival rates and constructed prognostic risk assessment model. In order to nd the cause of the differences in patient survival, we gourped the immune cell in ltration in the high-and low-risk groups, and analyzed the relationship between the mRNAs and immune cell in ltration. Finally, we used another independent data set to verify the prognostic effect of our model. Through this research, we aimed to identify high-quality biomarkers for STAD and provide reasonable results for further elucidation of the molecular mechanism of STAD.

Materials And Methods
Data collection and screening of DEGs. Gene expression pro les microarray data (GSE118916 of 15 STAD and 15 healthy tissues, GSE13861 of 65 STAD and 19 healthy tissues, GSE103236 of 10 STAD and 9 healthy tissues) were downloaded from the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/). The 'limma' R package was used to screen the DEGs between STAD and matched normal tissues [12]. Adjusted p-value < 0.05 and |log2 fold change (FC)| > 1 were set as the thresholds for DEG identi cation.
GO enrichment analyses. The Database for Annotation, Visualization, and Integrated Discovery (DAVID, https://david.ncifcrf.gov/) was used for biological process (BP), cellular component (CC) and molecular function (MF) enrichment analyses of DEGs [13]. P-value < 0.05 was used to screen statistically signi cant terms.
TCGA data analysis and survival analysis of DEGs. The data of STAD cases with both RNA sequencing and clinical information of TCGA were obtained from Genomic Data Commons (GDC) data portal (https://portal.gdc.cancer.gov/). We deleted cases with missing clinical information and therefore retained 370 cases. The mRNA high-level and low-level grouping was based on the median expression value of the mRNA, Survival curves were analyzed by the Kaplan-Meier method [14]. P-value < 0.05 was considered statistically signi cant.
Grouping of Samples and Construction of Prognostic Model. We used R 3.6.2 with 'survival' package to univariate Cox regression analysis and multivariate Cox regression. In order to reduce the number of mRNAs with similar expressions, mRNAs with p-value < 0.05 of univariate Cox regression were subjected to a stepwise multivariate Cox regression to construct the prognostic model. This model was used to evaluate the survival prognosis of patients in TCGA-STAD datasets using Kaplan-Meier curve, and logrank test according to median value grouping of risk score.
Inference of in ltrating cells in the TME. We used the CIBERSORT algorithm and the LM22 gene signature, which is a widely used approach to characterize the 22 human immune cell composition, including B cells, T cells, nature killer cells and macrophages [15]. After uploading the gene expression data with standard annotation on to the CIBERSORT web portal (http://cibersort.stanford.edu/), the algorithm ran under LM22 signature and 1,000 permutations.
Validation and Evaluation of Prognostic Model. Gene expression pro les microarray data (GSE84437 of 443 gastric cancer cases) were downloaded from GEO database to validate our prognostic model. The prognostic model was used to evaluate the survival prognosis of each patients in GSE84437 datasets using Kaplan-Meier curve, and log-rank test according to median value grouping of risk scores.
Statistical analysis. The present study used Mann-Whitney U tests (also called the Wilcoxon rank-sum test) statistically analyzing the gene expression and immune in ltration of two groups. A threshold of p < 0.05 was considered statistically signi cant. The gene expression correlation and correlation between immune in ltration and gene expression was accessed by Pearson's R and statistical signi cance. The absolute value of R greater than 0.1 was considered to be relevant and p-value < 0.05 was considered statistically signi cant. Gene expression data were processed by plus 1 and then take the log2 value.

Results
Identi cation of DEGs. A total of 3,860 DEGs were screened from the GSE118916 dataset. In addition, 550 and 463 DEGs were selected from the GSE13861 and GSE103236 datasets, respectively. Volcano plots were plotted to present the distribution of DEGs between OSCC and normal samples in each dataset ( Fig. 1A-C). After the combination, a total of 44 DEGs with adjusted p-value < 0.05 and |log2 fold change (FC)| > 1 were screened out at the intersection of the three datasets.
Functional and pathway enrichment analyses of the DEGs. To reveal the biological functions of the DEGs, GO enrichment analysis was conducted with DAVID. Regarding molecular function, the GO analysis results showed that the DEGs were mainly enriched in terms related to extracellular matrix binding and cytokine activity. These DEGs were involved in cell adhesion, wound healing and extracellular matrix organization biological processes. For cellular components, the DEGs were enriched in extracellular regions, including the proteinaceous extracellular matrix, extracellular region and extracellular space (Fig. 2).
Survival analysis of DEGs.
To further evaluate the prognostic value of the 45 DEGs, the clinical data of patients with STAD were downloaded from the TCGA database. The overall survival of patients with STAD based on the high and low expression of DEGs, was then obtained using Kaplan-Meier plotters. The results indicated that the high expression of DPT, COL5A2, CTHRC1 and low expression of ECT2, LRFN4 was associated with poor overall survival in patients with STAD (Fig. 3). In short, we found 5 genes that were signi cantly related to the prognosis of patients with STAD.
Construction and Evaluation of Prognostic Model.
The univariate Cox regression analysis displayed that of 5 mRNAs were found to be associated with overall survival in TCGA-STAD dataset (N = 370). Three mRNAs of p-value < 0.05 were selected for further analysis (Table 1). Two (LRFN4, CTHRC1) of three mRNAs screend out by stepwise multivariate Cox regression analysis (Fig. 4A). Then, the prognostic model was constructed by expression of LRFN4 and CTHRC1 and its coe cient in multivariate Cox regression as follows: risk score = (-0.20788 × expression of LRFN4) + (0.18741 × expression of CTHRC1). According to the median value of risk scores, patients were divided into the high-risk group and the low-risk group, while high-risk group has worse prognosis (Fig. 4B). Delete data with missing age, sex and tumor stage information and keep 344 samples for the next analysis. Multivariate Cox regression analyses also revealed that the risk score was an independent predictor of survival in TCGA datasets, after adjusting for age (< 60 vs. ≥60), sex (male vs. female), and tumor stage (I and II vs. III and IV) (Fig. 5).
The difference of tumor in ltrating immune cell composition between the high-risk group and the low-risk group was analyzed by CIBERSORT. The in ltration level of eight immune cells was signi cant between the two groups, including CD4 memory resting T cells, M2 macrophages, memory B cells, resting dendritic cells, eosinophils, gamma delta T cells (Fig. 6). We also explored the relationships between these six immune cells and the expression of two genes include LRFN4 and CTHRC1. The result showed that in ltration levels of CD4 memory resting T cells, memory B cells, eosinophils and gamma delta T cells were signi cantly correlated with expression levels of LRFN4 and in ltration level of CD4 memory resting T cells, memory B cells and M2 macrophages were signi cantly related with expression of CTHRC1 (Fig. 7).

Evaluation of Prognostic Model for Over Survival in GEO dataset
We evaluated our model by GSE84437 dataset download from GEO database (N = 443). The expression of CTHRC1 was higher in the high-risk group and the expression of LRFN4 was higher in the low-risk group in TCGA-STAD dataset (Fig. 8A, B). The expression of CTHRC1 and LRFN4 showed the same expression pattern in the GSE84437 dataset (Fig. 9A, B). In addition, the risk score of each patient was calculate by the prognostic model proposed above and divide the patients of GSE84437 dataset into high-risk group and low-risk group based on the median risk score. Survival analysis results showed that patients had a worse prognosis in the high-risk group (Fig. 9C).

Discussion
In the present study, 44 DEGs were identi ed between STAD and healthy samples from three datasets. To better clarify the functions of DEGs, we further performed functional enrichment analysis. The proteins translated by DEGs were mainly located in extracellular regions and these genes were primarily implicated in tumor-related biological processes such as cell adhesion, wound healing and extracellular matrix organization [16][17][18].
Then, seven candidate genes (DPT, ECT2, COL5A2, CTHRC1, and LRFN4) that were closely related to the survival rate of STAD patients were identi ed by analyzing the total survival information from STAD patients in TCGA program. Based on the results of univariate Cox regression and stepwise multivariate Cox regression in TCGA-STAD dataset, CTHRC1 and LRFN4 were eventually used to constructed risk score and prognostic model. Kaplan-Meier curves showed that the high-risk group had an obviously poorer overall survival compared to the low-risk group in the three groups. In addition, multivariate Cox regression analyses indicated that the risk score was an independent predictor of survival in TCGA datasets, after adjusting for age, sex, and tumor stage.
CTHRC1 encodes a protein that may play a role in the cellular response to arterial injury through involvement in vascular remodeling [19]. Previous studies had shown that CTHRC1 promoted tumor cell progression and might play a key role in the invasion and metastasis of cervical carcinoma, cervical squamous cell carcinoma and colorectal cancer [20][21][22]. In addition, CTHRC1 promotes M2-like macrophage recruitment and myometrial invasion in endometrial carcinoma by integrin-Akt signaling pathway [23], which indicated that CTHRC1 might be a biomarker for tumor immunotherapy.
LRFN4 encodes leucine-rich repeat and bronectin type III domain-containing 4, belongs to the superfamily of leucine-rich repeat-containing adhesion molecules [24]. In a previous study, LRFN4 was highly expressed in tumor cells, which is consistent with our analysis results (FIG S1) [25]. However, this study showed that high expression of LRFN4 is associated with poor prognosis, which contradicts our ndings (Fig. 3C). Due to the lack of researches on the relationship of LRFN4 and cancer, the correlations between LRFN4 and cancer need to be veri ed.
Next, immune cell in ltration analysis showed that eight immune cells have difference in the high-risk group and the low-risk group. Some studies have revealed the important role of these cells in cancer. For example, M2 macrophages were suppressors of anti-tumor immunity and gamma delta T cells were signi cantly correlated with an increased risk of death in pancreatic ductal adenocarcinoma [26,27]. Moreover, in ltration level of three immune cells including CD4 memory resting T cells, memory B cells and M2 macrophages were signi cantly correlated with expression of CTHRC1 and four immune cells contain CD4 memory resting T cells, memory B cells, eosinophils and gamma delta T cells were signi cantly related with expression of LRFN4. These results indicated that the expression of CTHRC1 and LRFN4 may affect tumor growth, metastasis and invasion through these immune cells, further affecting the prognosis of patients.
In the last, we downloaded a data set (GSE84437) including 443 patients' clinical and gene expression information from GEO database. Based on the median risk score derived from the prognostic model, 443 patients were divided into the high-risk group and the low-risk group. Survival analysis showed that survival time of patients in the high-risk group was signi cantly lower than that in the low-risk group. And in both the prognostic model and the verifying model, CTHRC1 was higher expressed in the high-risk group, and LRFN4 was lower expressed in the high-risk group. These results indicated that our model performed well in different data sets and had good clinical application prospects.
Despite all of these studies, our research still included some limitations: (a) The prognostic model is based on the expression of genes in the tissue and based on the expression of genes and proteins in the blood have more convenience in clinical application; (b) The prognostic effect of CTHRC1 and LRFN4 had been manifested, but the speci c role played in STAD still needs further clari cation. Therefore, these will also be the focus of our next study.
In conclusion, we constructed a new predictive model of mRNA prognosis through mRNA expression pro ling, the model was veri ed and evaluated by another independent dataset. In addition, we analyzed the possible causes of the difference in prognosis from the perspective of immune microenvironment. However, this is just a study based on the online database using bioinformatics. We hope that there will be numerous of experiments to verify the feasibility of this prognostic model in the future and provide a reliable predictor and therapeutic target for STAD patients.

Declarations
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.

Availability of data and materials
The datasets used and/or analyzed during the present study are available from the corresponding author on reasonable request.      Multivariate Cox regression analysis of clinicopathologic factors and risk score for STAD in TCGA sets.

Figure 6
Multivariate Cox regression analysis of clinicopathologic factors and risk score for STAD in TCGA sets.