A 14 transcription factors-associated nomogram predicts the recurrence-free survival of gastric cancer CURRENT

Background : We aimed to construct and validate a novel transcription factors (TFs) signature for the prediction of gastric cancer (GC) patient’s recurrence-free survival (RFS) from TCGA and Gene Expression Omnibus (GEO) database and improve the predictive ability of RFS in GC patients. Methods : We searched TCGA database and GEO database to obtain gene expression data and related clinical information for GC. In total, 722 TFs and 384 GC patients with intact clinical information were identified to develop a novel TF signature. All TFs were included in a univariate Cox regression model. We then used the least absolute shrinkage and selection operator (LASSO) Cox regression model which included only TFs with P < 0.05 in the univariate model to identify candidate TFs related to RFS. After further adjustment, multivariate Cox regression was performed based on the candidate TFs for the identification of TF signatures in the RFS evaluation of GC patients. Results : We successfully confirmed the high ability of the 14-TF panel for predicting GC patients’ RFS by receiver operating characteristic (ROC). AUCs at 1, 3, 5 years in internal validation dataset were 0.827, 0.817, 0.811, respectively. Some similar results were calculated in external validation dataset (0.808, 0.907, 0.813, respectively) and entire dataset (0.815, 0.849, 0.801, respectively). Besides, our model makes a good distinction between the high-risk group and the low-risk group. Furthermore, a nomogram was developed via risk score, sex, cancer status and tumor grade, and C-index, ROC, the calibration plots as well as decision curve analysis (DCA) demonstrated good ability and clinical application of the nomogram. Conclusions : We successfully established and validated a novel 14-TF-associated nomogram for predicting the RFS of GC.

effective signatures for predicting the prognosis of GC could improve individualized clinical management.
Numerous researches reported that TFs played a significant role in the progression and prognosis of cancer. For instance, Edwards et al. revealed that ZEB1 served as a TF which was prognostic and predictive in diffuse gliomas [7]. Oktay et al. indicated that thyroid TF-1 played a key role in the prognosis of lung adenocarcinoma (LUAD) [8]. Su et al. reported that TF 7 served as a poor prognostic marker of glioblastoma multiforme by enhancing proliferation by upregulating c-myc [9]. Lee et al. suggested that combination immunohistochemistry for SMAD4 and runt-related TF 3 may determine a favorable prognostic subgroup of pancreatic ductal adenocarcinomas [10]. Fan et al. reported that microRNA-301a-3p overexpression may contribute to cell invasion and proliferation by targeting runtrelated TF 3 in prostate cancer [11]. Therefore, the studies on TF are promising in identifying predictive biomarkers to help doctors offer personalized treatments for cancer and may improve patients' survival time. However, few researches have revealed the key role of TFs as independent biomarkers for GC prognosis. The identification of TFs as independent and valuable predictors for GC prognosis by a comprehensive and systematic method is essential.
In our study, we obtained genes expression data and clinical information for GC from TCGA and GEO databases and corresponding TFs and eligible patients were determined to investigate transcription factor marker for GC prognosis. We identified a 14-TF signature for predicting RFS of GC patients by bioinformatic integrated analysis. According to Kaplan-Meier method ROC analysis, we confirmed the high ability of the 14-TF signature in prognostic assessment for GC. Besides, we developed a predictive nomogram that integrated the 14-TF signature with the conventional clinicopathological factors and the result suggested a good predictive value of our nomogram.

Data source and processing
We searched TCGA database and GEO database using TCGAbiolinks package [12] and GEOquery package, respectively [13] to obtain genes expression data and related clinical information for GC. In total, 24991 genes and 407 GC patients with intact clinical information in TCGA database were included. Samples without prognostic data or non-TF genes were excluded from subsequent analysis.
TFs were determined based on TRRUST database (https://www.grnpedia.org/trrust/downloadnetwork.php) [14]. Raw counts of expression matrix were converted to transcripts per million (TPM). Genes with 0 expression more than 20% of the samples were excluded. Finally, 722 TFs and 384 patients with GC were identified for Univariate Cox regression analyses. Similarly, raw data of GSE26253 was preprocessed and normalized by the robust multichip averaging (RMA) [15] method using affy packages [16] of R (v3.6.1). In the end, 432 patients in GSE26253 were included as an external validation set. LASSO method was used for identifying the candidate TFs to predict RFS of GC patients. LASSO COX regression model was conducted via a publicly available R package for 1000 iterations.

Gene sets enrichment analysis and protein-protein interaction (PPI) analyses
The TFs screened by the univariate Cox regression analyses (p<0.05) were used to perform Gene Ontology (GO) analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis using R clusterProfiler package [17]. GO analysis was significantly involved in cellular components, biologic processes and molecular functions of genes [18]. KEGG pathway analysis was mainly associated with the molecular interaction, reaction, and relation networks [19].
A PPI network was built on the basis of the Search Tool for the Retrieval of Interacting Genes (STRING, http://string-db.org) and visualized by Cytoscape (ver. 3.5.1) [20]. Only experimentally confirmed interplays with a combined score >0.4 were selected as significant. The plug-in Molecular Complex Detection (MCODE) was employed to screen the prime module from the PPI network. The criteria were defined as follows: MCODE scores >3 and the number of nodes ≥4.

Generation of TF-associated signature
The association between the TFs expression and patient's RFS was evaluated by univariable Cox regression analysis to select TFs relevant to patients' RFS. Then, the determined TFs were used to perform LASSO analysis for selecting the candidate TFs reliably associated with RFS. After that, multivariate Cox regression was executed based on the candidate TFs for the predictive TF signatures in the survival RFS evaluation of GC patients. 384 patients were randomly assigned to training set (n = 269) and internal validation set (n = 115).
The training cohort was employed for the identification of prognostic TF signature. Internal validation set and external validation set as well as entire TCGA dataset were applied to validate our results.
The TF risk score formula was then established to determine RFS risk for every patient with the coefficients from the multivariate Cox regression analysis. Patients with GC in each set were stratified into high-risk or low-risk group with the corresponding median risk score as the cutoff point. Survival differences between the high-risk and low-risk groups in each set were weighed using the Kaplan-Meier method, and compared by the log-rank test. We conducted ROC analysis to assess the sensitivity and specificity of the survival prediction based on the TF risk score. The greater the AUC value was, the more superior the model was for the hazard prediction. Then stratified analysis was acted based on clinical parameters in the whole TCGA set. All ROC and Kaplan-Meier curves were drew with R (version 3.6.1).

Gene set variation analysis (GSVA)
To unearth the 14-TF signature-related signaling pathways, single sample gene sets enrichment analysis (ssGSEA) was carried out through the TCGA GC mRNA dataset with GSVA package [21]. The top 20 important pathways positively related to risk score were investigated. Patients were assigned into high-or low-risk cohorts with the cutoff of the median risk score. Adjusted P < 0.05 was considered to be of significance.

Construction of the nomogram
The univariate Cox proportional hazard analysis and multivariate Cox proportional hazard analysis were acted on the basis of risk score and other clinicopathological factors. The factors with P<0.05 from multivariate Cox proportional hazard analysis were employed to establish a nomogram via the 'rms' R package. Hazard ratios (HR) and corresponding 95% confidence interval (CI) were measured by Cox proportional hazard models. The prognostic value of the nomogram was assessed by C-index, ROC, calibration plots and DCA. The outcome of the nomogram was listed in the calibrate curve and the 45-degree line implied an ideal performance.

Clinical characteristics of the study populations
The study was performed on 384 patients who were clinically and pathologically diagnosed with GC.

Relationship between the 14-TF signature and GC patients' RFS in internal validation dataset and external validation dataset as well as the whole dataset
Kaplan-Meier analysis was applied to measure the difference in RFS between the two groups. RFS for the high-score GC patients was shorter than that for the low-score GC patients in internal validation set (P= 3e-10) ( Figure 5A). A similar outcome was observed in the external validation dataset (p =6e-05) ( Figure 5C) and entire dataset (p = 1e-13) ( Figure 5E).

Evaluation of the predictive performance of the 14-TF signature by using ROC analysis
Time-dependent ROC curves were drew to assess the predictive power of the 14-TF signature. The AUC of the 14-TF signature at 1, 3, 5 years in internal validation dataset were 0.827, 0.817, 0.811, respectively ( Figure 5B). A high predictive power was also presented in external validation dataset (0.808, 0.907, 0.813) ( Figure 5D) and entire dataset (0.815, 0.849, 0.801) (Figure 5F). The result suggested that the 14-TF signature was a stable predictor for RFS of GC patients.
Furthermore, patients were ranked with the risk scores (Figure 6A), and the dot plot was drew via their survival status ( Figure 6B). The outcome implied that the high-risk cohort generated a greater mortality rate than that in the low-risk cohort. Heatmap of 14 TFs grouped according to risk score was presented in Figure 6C, which confirmed our previous boxplot. A similar result was obtained in consisting of age, gender, stage, histologic type, anatomic site and metastasis status. The result demonstrated a good predictive power of the 14-TF in the majority of sub-groups (Figure S3-S7).

Determination of the 14-TF signature-related biological pathways
Patients were assigned into high-or low-risk cohort in accordance to the cutoff of the median risk score. Top 20 pathways that were more activated in the high-risk patients than that in low-risk patients were exhibited in (Figure 7A) ( Table S4). The same trend was evident in the enriched pathways and risk score (Figure 7B), suggesting a good correlation between the pathways and the risk score.

Nomogram development
We performed univariate and multivariate Cox model via TF related risk score and a few other  (Figure 10C & 10D & 10E). In addition, the DCA implied that the established nomogram had more crucial clinical value for the prediction of RFS in GC patients than that in treat all or treat none cohort. The particular benefit was obtained for GC patients' 3-year recurrent risks (Figure 10F), suggesting strong robustness of our model.

Discussion
GC remains a severe challenge for public health worldwide. Currently, the prognostic models for GC are mainly based on the UICC Tumor-Node-Metastasis staging system. Whereas, the results for patients with a similar TNM stage yield great difference due to the inherent heterogeneity [3][4][5][6]. The identification of novel prognostic predictors and the establishment of more valuable prognostic models are urgently required.
In this study, we identified a combination of 14 TFs (NOTCH3, NR5A1, WDR5, RARB, SRCAP, SMAD3,   ONECUT1, PITX3, TRAF6, MTA2, JDP2,  Limitations in this study were the following: First, more independent external validation sets were needed to evaluate the power of the 14-TF signature. Second, we constructed the nomogram based on the TCGA dataset solely due to the lack of complete clinical information of GSE262563 dataset.
Whereas, there were still a few valuable virtues in our study. LASSO method was used to filter variables between univariate and multivariate Cox analysis, eliminating the interference of multicollinearity. In addition, no studies have combined TF signature with clinical factors to predict RFS for GC yet. We combined TF bioinformatics analysis with clinical factors, which may help direct translational study and the application of molecularly targeted therapy. We built a nomogram that combined both the 14-TF signature and the conventional clinicopathological factors to predict 1, 3and 5-year RFS. The outcome implied good ability of 14-TF signature for predicting RFS of GC patients in the clinical routine, which made our results more significant. Furthermore, the DCA was also employed to measure the value of our nomogram. Various studies suggested that DCA was implemented for the assessment of predictive models in clinical studies. For instance, Ishioka et al.

Conclusion
We identified a 14-TF signature for predicting the RFS of GC patients by bioinformatic integrated analysis. According to Kaplan-Meier method ROC analysis, we confirmed that the 14-TF signature was an effective prognostic predictor for GC patients' RFS. In addition, we built a predictive nomogram that integrated the 14-TF signature and the conventional clinicopathological factors and the result proved a good predictive capacity of our nomogram.  Flow chart of the present study.   Boxplots of 14 transcription factors expression values against risk group in the TCGA dataset. "High" and "Low" represent the high-risk and low-risk group, respectively. The differences between the 2 groups were measured by Mann-Whitney U test, and P values were provided in the graphs.  TFs risk score analysis of 384 GC patients in the TCGA dataset (A) TFs risk score distribution   TFs-associated nomogram for the prediction of GC's RFS. The nomogram was developed in the entire TCGA cohort, with the TFs risk score, sex, cancer status and tumor grade.