CA9 is an important molecular marker for the prognosis of osteosarcoma patients

As a common malignant bone tumor, osteosarcoma (OS) progresses rapidly and recurs easily. This study is aimed to build a risk score system for OS patients. From The Cancer Genome Atlas and Gene Expression Omnibus databases, the RNA-seq data of OS (the training set) and GSE39055 (the validation sets) separately were obtained. Combined with limma package, the differentially expressed lncRNAs (DE-lncRNAs) and mRNAs (DE-mRNAs) between recurrence and non-recurrence groups were analyzed. After the RNAs correlated with independent prognosis were screened using survival package, risk score systems were constructed and the optimal one was selected. For the independent clinical prognostic factors identied by survival package, stratication analysis was conducted. Using Gene Set Enrichment Analysis, pathways were enriched for the differentially expressed genes (DEGs) between high and low risk groups. For and and 10 and and eight were found to be correlated with independent From the four risk systems, the expression system Among the metabolism were enriched for the DEGs between high and low risk groups.

samples) with survival prognosis information; platform: Illumina HumanHT-12 WG-DASL V4.0 R2 expression beadchip) was obtained as the validation set. The raw expression data of the two datasets were downloaded, and then the expression data was normalized using the unit-scale normalization algorithm [15].

Differential expression analysis
Combined with the reference sequence (RefSeq) identi cation numbers (IDs) of the downloaded data, the mRNAs and lncRNAs shared by the two datasets were identi ed based on the annotation information in HUGO Gene Nomenclature Committee (HGNC, http://www.genenames.org/) database [16].
Based on the information of recurrence or not, the OS samples in the training set were divided into recurrence and non-recurrence groups. Using the R package limma (version 3.34.7, https://bioconductor.org/packages/release/bioc/html/limma.html) [17], the differentially expressed RNAs (DERs, including both differentially expressed lncRNAs (DE-lncRNAs) and differentially expressed mRNAs (DE-mRNAs)) between recurrence and non-recurrence groups were screened. The |log 2 fold change (FC)| > 0.263 and false discovery rate (FDR) < 0.05 were selected as the cut-off criteria. Based on the centered Pearson correlation algorithm in the R package pheatmap (version 1.0.8, https://cran.r-project.org/web/packages/pheatmap/index.html) [18], bidirectional hierarchical clustering for the expression values of the DERs were performed.

Construction of risk score systems
Using the univariate Cox regression analysis in the survival package (version 2.41-1, http://bioconductor.org/packages/survivalr/) [19] in R, the DERs signi cantly correlated with overall survival were identi ed combined with the survival prognosis information of the OS samples in the training set.
Subsequently, the DERs associated with independent prognosis were further selected using the multivariate Cox regression analysis in the survival package [19]. The log-rank p-value < 0.05 was the signi cant threshold.
After the prognostic coe cients of the independent prognosis-associated DE-lncRNAs and DE-mRNAs were obtained, the risk score systems were constructed.
With the Monte-Carlo p-value < 0.05 as the criterion, the cutoff values of the expression levels of the independent prognosis-associated DE-lncRNAs and DE-mRNAs were selected using X-Tile Bio-Informatics Tool (https://medicine.yale.edu/lab/rimm/research/software.aspx) [20]. The status of each sample in the expression of the RNA was de ned based on the cutoff value of each RNA. When the expression level of the RNA was higher than its cutoff value, status was de ned as 1. When the expression level of the RNA was lower than its cutoff value, status was de ned as 0. Then, the risk score system was constructed by linear combination of RNA expression status weighted by regression coe cients, and the corresponding formula for calculating risk score (RS) was: Status Risk Score = ∑β RNAn × Status RNAn β and status separately represented the regression coe cient and the status variable (both the mRNA expression status-based risk score system and the lncRNA expression status-based risk score system were built here).
Besides, the expression level-based risk score systems were constructed based on the following calculation formula: Expression Risk score = ∑β RNAn × Exp RNAn β and "Exp RNAs " stood for the regression coe cient and the expression level of RNA. (both the mRNA expression level-based risk score system and the lncRNA expression level-based risk score system were constructed).
The RSs of the OS samples in the training set were calculated, and their median was used as the demarcation point for dividing the samples into high risk (with RS no less than the demarcation point) and low risk (with RS lower than the demarcation point) groups. Using the Kaplan-Meier (KM) method in survival package [19], the correlation between risk grouping and actual survival prognosis information was evaluated. Meanwhile, the samples in the validation set were classi ed into high and low risk groups according to the median of their RSs. Afterwards, C-index and Brier score calculated by survcomp package (version 1.30.0, http://www.bioconductor.org/packages/release/bioc/html/survcomp.html) [21] and log-rank p-value calculated by survival package [19] were used to assess the correlation between risk grouping and actual survival prognosis information in the validation set.

Strati cation analysis
Using the univariate Cox regression analysis and multivariate Cox regression analysis in the survival package [19], the independent clinical prognostic factors were screened from the training set. The log-rank p-value < 0.05 was selected as the threshold of signi cant correlation.
To further investigate the relationship between the independent clinical prognostic factors in different risk groups and prognosis, strati cation analysis [22] for the independent clinical prognostic factors was carried out. Risk strati cation was performed for the OS samples in the training set to study the prognosis conditions of target clinical factors in different risk groups.

Pathway enrichment analysis
The RSs of the OS samples in the training set were calculated based on the optimal risk score system, and then the OS samples were divided into high and low risk groups. With the |log 2 FC| > 0.263 and FDR < 0.05 as the thresholds, the differentially expressed genes (DEGs) between the two groups were selected using limma package [17]. In addition, the DEGs were conducted with pathway enrichment analysis using Gene Set Enrichment Analysis (GSEA, http://software.broadinstitute.org/gsea/index.jsp) [23]. The FDR < 0.05 was set as the threshold for selecting signi cant pathways.

Differential expression analysis
Combined with HGNC database, a total of 15672 mRNAs and 993 lncRNAs shared by the two datasets were annotated. The 169 OS samples in the training set were divided into recurrence group (28 samples) and non-recurrence group (141 samples). There were 333 DERs between recurrence and non-recurrence groups, including 319 DE-mRNAs (120 up-regulated and 199 down-regulated) and 14 DE-lncRNAs (two up-regulated and 12 down-regulated). As shown in the hierarchical clustering heatmap, the expression values of the DERs in the samples could clearly classify the samples into two clusters (Fig. 1).
The cutoff value of the expression level of each independent prognosis-associated DE-lncRNA and DE-mRNA was obtained, based on which the status of each sample in the expression of the RNA was de ned. Then, the expression status-based risk score systems were constructed combined with the prognostic coe cients and expression status of the independent prognosis-associated DE-lncRNAs and DE-mRNAs, and the corresponding formula was: According to the median of the RSs, the OS samples in the training set and the validation set separately were divided into two groups (high risk group and low risk group). Based on KM method, correlation analysis for the risk grouping and actual survival prognosis information was performed. For both the training set and the validation set, the risk groups divided by the lncRNA expression status-based risk score system (training set: log-rank p-value = 2.196e-03, C-index = 0.703, Brier score = 0.0652; validation set: log-rank p-value = 3.537e-03, C-index = 0.779, Brier score = 0.0897) and mRNA expression status-based risk score system (training set: log-rank p-value = 6.264e-11, C-index = 0.878, Brier score = 0.0402; validation set: log-rank p-value = 8.905e-03, C-index = 0.822, Brier score = 0.0647) had signi cant correlations with the actual survival prognosis information (Fig. 2). Meanwhile, the risk groups divided by the lncRNA expression level-based risk score system (training set: log-rank p-value = 5.071e-04, C-index = 0.730, Brier score = 0.0603; validation set: log-rank p-value = 6.793e-03, Cindex = 0.887, Brier score = 0.101) and mRNA expression level-based risk score system (training set: log-rank p-value = 1.927e-08, C-index = 0.869, Brier score = 0.0397; validation set: log-rank p-value = 1.132e-01, C-index = 0.733, Brier score = 0.147) were signi cantly correlated with the actual prognosis information (Fig. 3). Through comparing the log-rank p-values, C-indexes, and Brier scores, the mRNA expression status-based risk score system was selected as the optimal risk score system.

Strati cation analysis
Combined with Cox regression analysis, three independent clinical prognostic factors (age, recurrence, and mRNA status based RS model) in the training set were identi ed ( Table 2). KM curves showed that OS patients with lower age and non-recurrence had better prognosis (Fig. 4), which was consistent with the actual situation. Strati cation analysis indicated that age and recurrence were signi cantly correlated with overall survival prognosis in the high risk group (Fig. 5).

Pathway enrichment analysis
Combined with the mRNA expression status-based risk score system, the OS samples in the training set were divided into high and low risk groups according to their RSs. There were 722 DEGs (144 up-regulated genes and 578 down-regulated genes) in high risk group in comparison to low risk group (Fig. 6). Moreover, seven pathways for the DEGs were enriched, including vascular smooth muscle contraction (FDR = 0.0000), histidine metabolism (FDR = 0.0217), and glycine, serine and threonine metabolism (FDR = 0.0436) ( Table 3).

Discussion
There were 319 DE-mRNAs (120 up-regulated and 199 down-regulated) and 14 DE-lncRNAs (two up-regulated and 12 down-regulated) between recurrence and non-recurrence groups. Through performing Cox regression analysis, 10 DE-mRNAs (including ALDH1A1, CA9, GMDS, LCMT2, LRRC75A, METTL1, RAB29, TADA2B, TDRD7, and TIGD2) and eight DE-lncRNAs correlated with independent prognosis were identi ed. After four risk score systems were built, the mRNA expression status-based risk score system was selected as the optimal one. Three independent clinical prognostic factors (age, recurrence, and mRNA status based RS model) were selected, among which age and recurrence were signi cantly correlated with overall survival prognosis in the high risk group. Previous study has revealed that CA9 is overexpressed in OS compared with bone marrow stromal cells (BMSCs) and CA9 inhibitor 3 can reduce tumor growth via causing signi cant necrosis [24]. CA9 expression is detected in OS patients, and its inhibitors is able to inhibit cell proliferation, cell migration, and chemoresistance in OS under hypoxic conditions [25]. Therefore, CA9 might act in the prognosis of OS patients.
Although ALDH1A1, GMDS, LCMT2, LRRC75A, METTL1, TADA2B, and TDRD7 have not been reported to be directly correlated with OS, they have found to be able to affect other tumors. Through upregulating the cancer stem cell marker ALDH1A1, CCAAT-Enhancer-Binding Protein β1 (C/EBPβ1) accelerates transformation and results in chemoresistance in Ewing sarcoma [26]. GMDS de ciency promotes the resistance of colon cancer cells to the apoptosis caused by tumor necrosis factor-related apoptosis-inducing ligand (TRAIL), leading to tumor progression and metastasis through evading tumor immune surveillance [27,28]. Some cancer-related genes (including LCMT2) possess frameshift mutations and mutational intratumoral heterogeneity in colorectal cancer (CRC) with high microsatellite instability, and their frameshift mutations may be involved in the tumorigenesis of the disease [29]. Decreased LRRC75A antisense RNA 1 (LRRC75A-AS1) plays an anti-tumor role in the tumorigenesis and progression of CRC, which may serve as a diagnostic marker in CRC [30]. Upregulated METTL1 is related to unfavorable outcomes in hepatocellular carcinoma, and METTL1 contributes to cell proliferation and migration in the tumor by inhibiting PTEN signaling [31]. Several highest-ranking genes (including TADA2B) are selected by RNAi Gene Enrichment Ranking algorithm, and their loss contributes to vemurafenib resistance in melanoma [32]. TDRD7, heat shock protein 90B (HSP90B), RAB1A, and vimentin are found to be pseudopodia-localizing proteins, which play roles in cell migration and affect tumor malignancy [33]. These suggested that ALDH1A1, GMDS, LCMT2, LRRC75A, METTL1, TADA2B, and TDRD7 might also be implicated in the prognosis of OS patients.
Additionally, 722 DEGs (144 up-regulated genes and 578 down-regulated genes) between high risk and low risk groups were screened, for which seven pathways (including vascular smooth muscle contraction, and glycine, serine and threonine metabolism) were enriched. Structural damage and contractile dysfunction of vascular smooth muscle and anti-cancer effects may be induced by the ginsenoside Rg3 via destroying the actin [34]. Smooth muscle contraction, differentiation, and migration, as well as cancers and cardiovascular diseases are affected by the actin-associated protein palladin [35]. The mTOR complex 1 (mTORC1)/serine/glycine metabolic axis contributes to antioxidant ability, cell proliferation, and cell survival in OS, indicating that mTORC1mediated serine/glycine metabolism exerts a protective effect on OS cells [36,37]. Thus, vascular smooth muscle contraction, and glycine, serine and threonine metabolism might be related to the prognosis of OS patients.
After differential expression analysis for recurrence and non-recurrence groups was conducted, the DE-mRNAs and DE-lncRNAs correlated with independent prognosis were selected. Moreover, risk score systems were constructed based on the prognosis-associated RNAs, followed by the independent clinical prognostic factors correlated with prognosis were revealed by strati cation analysis. Nevertheless, these ndings were obtained on the basis of bioinformatics analyses and needed to be con rmed by adequate experiments.

Conclusion
In conclusion, 319 DE-mRNAs and 14 DE-lncRNAs were identi ed from recurrence and non-recurrence groups. Besides, the mRNA expression status-based risk score system (involving ALDH1A1, CA9, GMDS, LCMT2, LRRC75A, METTL1, RAB29, TADA2B, TDRD7, and TIGD2) might be applied for predicting the prognosis of OS patients. Furthermore, vascular smooth muscle contraction, and glycine, serine and threonine metabolism might have correlations with the prognosis of OS. Availability of data and material The raw data were collected and analyzed by the Authors, and are not ready to share their data because the data have not been published.
Competing interests:   age; right: recurrence); (B) The KM curves for high risk group (left: age; right: recurrence). In the KM curves for age, blue and red curves separately represent the samples no older than 60 years and the samples aged over 60 years. In the KM curves for recurrence, blue and red curves represent the samples in nonrecurrence and recurrence groups, respectively.