15-lncRNAs-based classifier-clinicopathologic nomogram improves the prediction of recurrence in patients with hepatocellular carcinoma CURRENT STATUS: POSTED

Background: Our study aims to develop an lncRNAs-based classifier and a nomogram incorporating the genomic signature and clinicopathologic factors to help to improve the accuracy of recurrence prediction for hepatocellular carcinoma(HCC) patients.Methods: The lncRNAs profiling data of 374 HCC patients and 50 normal healthy controls were downloaded from the Cancer Genome Atlas (TCGA). Using univariable Cox regression and Least absolute shrinkage and selection operator (LASSO) analysis, we developed 15-lncRNAs-based classifier and compared our classifier to existing six-lncRNAs signature. Besides, a nomogram incorporating the genomic classifier and clinicopathologic factors was also developed. The predictive accuracy and discriminative ability of the genomic-clinicopathologic nomogram were determined by a concordance index (C-index) and calibration curve and were compared with TNM staging system by C-index, receiver operating characteristic (ROC) analysis. Decision curve analysis (DCA) was performed to estimate clinical value of our nomogram.Results: Fifteen relapse-free survival (RFS) -related lncRNAs were identified and the classifer, consisting of the identified15 lncRNAs, could effectively classify patients into high-risk and low-risk subgroup. The prediction accuracy of the 15-lncRNAs-based classifier for predicting 2- year and 5-year RFS were 0.791 and 0.834 in the training set and 0.684 and 0.747 in the validation set, which was better than the existing six-lncRNAs signature. Moreover, the AUC of genomic-clinicopathologic nomogram in predicting RFS were 0.837 in the training set and 0.753 in the validation set, and the C-index of the genomic-clinicopathologic nomogram was 0.78 (0.72-0.83) in the training set and 0.71 (0.65-0.76) in the validation set, which was better than traditional TNM stage and 15-lncRNAs-based classifier. Decision curve analysis further demonstrated that our nomogram had larger net benefit

year, there are 841,000 patients developed HCC and 782,000 HCC patients died [1]. Till now, only two therapeutic treatments, including surgical resection and liver transplantation, is recommended as the first-line therapyto potentially cure HCC [2]. However, surgical resection is hampered as more than 70% of HCC patients experience disease recurrence approximately at 5 years after resection [3], and donor organ shortages render the large-scare application of liver transplantation. Therefore, identifying reliable and accurate predictive markers/models to screen out which subset of patients with HCC is vulnerable to develop recurrence is urgently needed.
Long non-coding RNAs (lncRNAs) are newly-discovered RNA transcripts which were found to play an Moreover, prognostic signature based on lncRNAs has also been found to improve the prognosis prediction of HCC [12; 13], but the predictive value of lncRNAs-based signature in recurrence of HCC remains poorly evaluated.
In the present study, we aimed to develop a lncRNAs-based classifier and a nomogram incorporating the genomic signature and clinicopathologic factors to help to improve the accuracy of recurrence for HCC patients after surgery. We identified lncRNAs that were significantly associated with relapse free survival (RFS) of HCC patients from the Cancer Genome Atlas (TCGA) and then used them to constructed lncRNAs-based classifier in the training set. A nomogram incorporating the lncRNAsbased classifier and clinicopathologic factors was also developed for predicting RFS. Finally, the predictive ability of the nomogram was evaluated and validated in an internal validation set.

Ethics Statement
All the data was obtained from TCGA, and the informed consent had been attained from the patients before our study.

Collection of lncRNAs data and clinical characteristics of HCC patients from TCGA
The lncRNAs profiling data of 374 HCC patients and 50 normal healthy controls were downloaded from TCGA. Then clinical parameters, such as age, gender, family history, alcohol consumption, mutation count, fraction genome altered, BMI, APF, platelet, albumin, creatinine, cirrhosis, histologic grade, T stage, TNM stage, Eastern Cooperative Oncology Group (ECOG), and RFS time were also downloaded from TCGA. Eighty-one HCC patients were excluded due to RFS time< 1 month or the unavailability of lncRNAs data. So, 293 HCC patients with available lncRNAs data and clinical characteristics were finally included in our study. Subsequently, 293 HCC patients were randomly assigned to a training set (N=147) and a validation set (N=146) by R software.

Construction and validation of lncRNAs-based classifier for RFS
First, moderated t-statistics method and Benjamini-Hochberg procedure were used to identify distinct differential lncRNAs between normal tissues and HCC tissues. The cut-off criteria of distinct differential lncRNAs was P<0.05 and the false discovery rate (FDR) <0.05. Next, univariable Cox regression analysis was used to select RFS-related lncRNAs in the training set. Least absolute shrinkage and selection operator (LASSO) analysis was used to further narrow down the RFS-related lncRNAs and a signature consisting these well-selected lncRNAs was developed [14]. Lasso analysis is a popular estimation procedure in multiple linear regression when underlying design has a sparse structure, as it could set some regression coefficients exactly equal to 0. In our study, we developed a perturbation bootstrap method and established its validity in approximating the distribution of the Lasso in heteroscedastic linear regression. The underlying covariates were allowed to be either random or non-random and the proposed bootstrap method was proved to work irrespective of the nature of the covariates. Simulation study also justified our method in finite samples. In order to obtain accurate estimate stability of our model, cross validation was performed, as it could provide unbiased estimation [15]. In the present study, a 5-fold cross validation was carried out, the 293 HCC patients were divided into 5 subsets of equal size and trained for 5 times, each time leaving out a subsets as validation data. The accuracy were 90.3%, 91.1%, 90.8%, 90.4% and 91% , respectively, indicating the good stability of our 15-lncRNAs-based classifier. By this classifier, we calculated the risk scores of HCC patients were and then divided patients into high-risk patients and low-risk subgroup based on the best cut-of value, which was a point when the Youden index (sensitivity +specificity-1) reached the maximum value in the training set. The RSF difference between high-risk patients and low-risk patients were further compared by the Kaplan-Meier analysis. Log-rank test was used to compare subgroups. The flowchart of the present study was shown in Figure 1.

Receiver operating characteristic (ROC)
To further evaluate the predictive accuracy of lncRNAs-based classifier, ROC analysis was performed in the training set and validation set. We calculated area under ROC curve (AUC) of lncRNAs-based classifier for predicting 2-year and 5-year RFS and compared the predictive ability of our lncRNAsbased signature with other published lncRNAs signature [16] for RFS which also developed from HCC patients of TCGA.

Genomic-clinicopathologic nomogram
In order to make lncRNAs-based classifier to be more applicable for clinicians, a lncRNAs-based classifier related nomogram was constructed. First, univariate and multivariate Cox regression analysis were used to identify clinical risk parameters associated with RFS in the training set. Next, the lncRNAs-based classifier, together with the risk parameters, were used to develop a genomicclinicopathologic nomogram in the training set.
Model performance was evaluated by determining the calibration and discrimination. Discrimination is the models ability to differentiate between patients who recur from HCC and patients who will not.
Discrimination was calculated through the concordance index (C-index). We also illustrated discrimination by dividing the dataset into three groups based on the score generated by the nomogram. We plotted a Kaplan-Meier curve for all three groups.
Calibration of the nomogram was assessed by plotting the observed RFS rate (the mean Kaplan-Meier estimate for patients in each octile) against the nomogram 2-and 5-year predicted RFS probability (ie, the mean nomogram predicted probability for patients in each octile). A perfectly accurate nomogram prediction model would result in a plot in which the observed and predicted probabilities for given groups would fall along the 45-degree line. The distance between the pairs and the 45-degree line was a measure of the absolute error of the nomogram's prediction.
ROC analysis was used to evaluate and compare the discrimination ability of the nomogram with lncRNAs-based classifier and TNM stage. Then, decision curve analysis (DCA) was performed to evaluate the clinical usefulness of the genomic-clinicopathologic nomogram [17; 18]. DCA was performed by calculating the net benefit for a range of threshold probabilities, which place benefits and harms on the same scale. This analysis determined whether clinical decision-making based on a model would do more good than harm. DCA provided straightforward information about the clinical value of a model, in contrast to traditional measures such as sensitivity or specificity, which were abstract statistical concepts.

15-lncRNAs-based classifier
To explore the biological function and pathways of the 15-lncRNAs-based classifier, GO and KEGG analysis were conducted. First, Pearson correlation algorithm was performed between these 15 lncRNAs and the protein-coding genes (mRNAs) and correlation coefficient >0.4, p<0.001 were considered as significant correlation. Then, potential biological processes of these lncRNAs target genes were further investigated by Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis in DAVID, a common bioinformatics tool (http://david.abcc.ncifcrf.gov/, version 6.8) [19] .

Statistical analysis
SPSS statistics 22.0 and R software (R version 3.5.2) were used to conduct the statistical analysis.
Univariate and multivariate Cox regression analysis was performed to identify potential predictors associated with RFS. If there were missed data in some of the potential predictors, these missing data would be imputed, as full case analysis would improve the statistical power and reduce potentially biased results. Multiple imputation was used to imput the missing data as the missing data were considered missing at random after analyzing patterns of them. Multiple imputation was performed MI with the Markov Chain Monte Carlo function and 5 iterations were used to account for possible simulation errors.
LASSO analysis was performed with "glmnet" packages, cross validation was conducted with "caret" packages, and ROC analysis was done with "survivalROC" packages. The nomogram and calibration plots were generated with "rms" packages, and DCA was performed with the "stdca.R". A two sided P<0.05 would be recognized as statistically significant.

Demographic parameters and RFS outcome of HCC patients
In the present study, 293 HCC patients with available lncRNAs data and clinical characteristics were included. The basic clinical characteristics of these HCC patients were summarized in Table 1. The median RFS was 20.99 months (range: 1.22-120.73 months). Of all the 293 HCC patients, 170 (57.9%) patients developed recurrence during follow-up and the 2-year and 5-year RSF rates were 46.4% and 29.1%, respectively. It was also of note that among the recurrent patients, 143 patients (84.1%) experienced recurrence during the first two years after resection.

Development and validation of lncRNAs-based classifier
First, 1292 distinct differential lncRNAs between normal tissues and HCC tissues were got basing on  Figure S2A, Figure S2C, Figure S1C ). Taken together, these results suggested that the 15-lncRNAs-based classifier could effectively classify HCC patients into two distinct subgroups with high risk or low risk of recurrence or OS.

Predictive value of 15-lncRNAs-based classifier and comparison with other lncRNAs-based classifier from TCGA
Recently, a six-lncRNAs-based signature was developed and validated by Gu et al to predict RFS with the data of HCC patients from TCGA [16]. To compare the predictive value of our 15-lncRNAs-based classifier with the six-lncRNAs-based signature, ROC cure analysis was performed. As was shown at

Development of genomic-clinicopathologic nomogram
To make lncRNAs-based classifier to be more applicable for clinicians, a 15-lncRNAs-based classifierclinicopathologic nomogram was developed to predict the 2-year and 5-year RFS in HCC patients.
Potential predictors associated with RFS were identified by univariate and multivariate Cox regression analysis in the training set. Univariable Cox regression analysis showed that mutation count, BMI, APF, liver cirrhosis, tumor stage, TNM stage, ECOG and the 15-lncRNAs-based classifier was related with RFS, and multivariate Cox regression analysis further showed that mutation count, AFP, T stage, ECOG and the 15-lncRNAs-based classifier were independent predictors of RFS (Table 3). So, these five predictors were used to develop the genomic-clinicopathologic nomogram (Figure 5), which would may help clinicians to preoperatively predict the recurrence risk in HCC patients. The C-index of the genomic-clinicopathologic nomogram was 0.78 (0.72-0.83) ( Table 4) and the calibration plots exhibited good consistency between the predicted RFS and the actual RFS (Figure 6A, 6C). Likewise, consistent results were also found in the validation set. The C-index of the genomic-clinicopathologic nomogram in the validation set was 0.71 (0.65-0.76) ( Table 4) and also showed good consistency between the predicted RFS and the actual RFS ( Figure 6B, 6D).Additionally, the tertiles of all the total points were used to divide the patients into high-, intermediate-and low-risk groups with distinct RFS time or OS time. The Kaplan-Meier analysis (Log-rank P<0.0001) of the three risk subgroups indicated the great utility of the genomic-clinicopathologic nomogram for in training set ( Figure S3A, Figure S4A), in validation set ( Figure S3B, Figure S4B), and in the total cohort ( Figure S5A, S5B). All these results indicated the perfect performance of our genomic-clinicopathologic nomogram.
To further evaluate the predictive ability of the genomic-clinicopathologic nomogram, we compared the C-index and ROC analysis results of genomic-clinicopathologic nomogram with AJCC TNM stage and the 15-lncRNAs-based classifier in the training set and validation set. As was shown at Table 4 Finally, the clinical usefulness of the genomic-clinicopathologic nomogram was assessed by the DCA, an abstract statistical concept, which gave visualized information on the clinical value of a model. As were presented in Figure 8A and Figure 8B, DCA results showed that HCC recurrence-associated treatment decision based on the genomic-clinicopathologic nomogram resulted more net benefit than treatment decision based on TNM stage or 15-lncRNAs-based classifier, or treating either all patients or none in the training set and the validation set.

Biological function and pathways of 15-lncRNAs-based classifier
To explore the biological function and pathways of 15-lncRNAs-based classifier, GO and KEGG analysis were performed. As were shown at Figure  could also divide patients into high-risk and low-risk groups with significantly different OS [13].Wu et al found that MIR22HG, CTC-297N7.9, CTD-2139B15.2, RP11-589N15.2, RP11-343N15.5, and RP11-479G22.8 were independent predictors of HCC patients' OS, and a signature consisted of these six lncRNAs can effectively classified patients into high-risk patients with shorter survival time and lowrisk patients with longer survival, and the six-lncRNA signature exhibited superior predictive capacity than TNM stage [12]. These studies suggested the potential clinical implications of lncRNA-based signature in improving the prognosis prediction of HCC. However, it should be noted that OS is more likely to be influenced by post-recurrence treatment and liver function, RFS could more accurately reflect the biologic behavior for HCC, thus in the present study, we tried to identified RFS-related lncRNAs, developed and validated a classifier, which may be more valuable for HCC patients management.
Recently, Gu et al develops and validates a six-lncRNAs-based signature to predict RFS also with the data of HCC patients from TCGA [16]. Except for the six-lncRNAs-based signature, Gu et al also developed another lncRNAs-based signature for HCC recurrence in patients with small HCC (maximum tumor diameter ≤5 cm) [21]. In his study, a 3-lncRNAs-based signature, consist of LOC101927051, LINC00667 and NSUN5P2, was developed and validated for predicting OS and RFS in patients with small HCC. This 3-lncRNAs-based signature was more suitable for patients with higher serum levels of AFP (>20ng/ml) and relatively lower levels of albumin (<4.0 g/dl) or Asian patients with no family history of HCC or history of alcohol consumption. Consistent with previous studies, mutation count, AFP, T stage, ECOG were found to be significantly associated with HCC recurrence in the present study [21; 22; 23; 24]. In addition to these clinical factors, as expected, the 15-lncRNAs-based signature, was also found to be significantly related with HCC recurrence in our study. Till now, only a few validated clinical nomograms for HCC recurrence have been reported [25; 26; 27]. For example, a nomogram consisting of 7 clinical factors, including age, AFP, PT, magnitude of hepatectomy, postoperative complication, number of tumor nodules and microvascular invasion was developed and validated using data of 617 HCC patients. However, the results limited its use for HCC patients beyond the Milan criteria [26]. Another nomogram incorporating sex, log of calculated tumor volume, ALB, platelet count, and microvascular invasion, was well-constructed and validated by Shim et al with data from 1085 HCC patients. However, this model could only be applied for early-stage HCC patients and the authors only evaluated the prediction accuracy for predicting 2-year RFS [27]. Different from the nomogram described above, we developed and validated a nomogram incorporating clinicopathologic parameter and genomic data, which may help to improve the stability and accuracy of prediction probability of the nomogram [28].
Among the 15 RFS-related lncRNAs, AP002478.1, GACAT3 and LINC00462 have been previously reported to be related with cancers. AP002478.1 has been reported to be potential prognostic biomarkers for HCC patients and gastric cancers patients [29; 30]. GACAT3 is the first to be found to significantly overexpress in gastric cancer tissues and gastric cancer cell line MGC-803. Higher expression was of GACAT3 significantly associated with tumor size, distant metastasis, TNM stages and shorter OS [31; 32]. One mechanism study showed that GACAT3 knockdown could significantly inhibit proliferation, colony formation, migration, and invasion of GC cells by regulating miR-497, while down-regulation of GACAT3 decreased its tumorigenesis [33]. Moreover, GACAT3 was also found to similar tumorigenesis in breast cancer, glioma and colorectal cancer [34; 35; 36]. Up-regulation of LINC00462 was found to be associated with larger tumor size, poorer tumor differentiation, TNM stage and metastasis of pancreatic cancer patients. Mechanism study indicated that LINC00462 could promoted proliferation, migration invasion of by regulating miR-665 [37]. Notably, LINC00462 was also significantly over-expressed in HCC tissues and knockdown of LINC00462 inhibited the aggressive oncogenic phenotype in HCC cells by regulating PI3K/AKT signaling pathway, suggesting LINC00462 may be a potential and promising therapeutic target for HCC [38]. Therefore, further research on the biological function of these identified lncRNAs may shed light on HCC recurrence.
Although our genomic-clinicopathologic nomogram demonstrated impressive performance in HCC recurrence prediction, limitation of this study should also be noted. First, our nomogram is limited by the retrospective collection of data and fails to include some already recognized RFS-related factors (eg, liver cirrhosis, vascular invasion) and some important molecular factors (eg, TP53 mutation).
Further efforts to incorporate more geographic and molecular factors will potentially help to improve the performance of the present model. Second, there is no eternal or prospective validation for the genomic-clinicopathologic nomogram in the present study, so external and multicenter prospective cohorts with large sample sizes are still needed to validate the clinical application of our model. Final, we do not explore the underlying biological function and pathways of the genomic classifier , so further mechanism study are needed to uncover the related mechanisms.
In conclusion, we develop an lncRNAs-clinicopathologic nomogram and demonstrate that it appear to be a more effective tool for HCC recurrence prediction, compared to TNM stage and other LncRNAbased signature from TCGA. The lncRNAs-clinicopathologic nomogram may help clinicians to make more fitly individualized therapeutic strategies for HCC patients. *Higher homogeneity likelihood ratio indicates a smaller difference within the staging system, it means better homogeneity **Higher discriminatory ability linear trend indicates a higher linear trend between staging system, it means better discriminatory ability and gradient monotonicity ***A higher c-index means better discriminatory ability.
****Smaller AIC values indicate better optimistic prognostic stratification Figures Figure 1 The flowchart of study design. LASSO: least absolute shrinkage and selection operator.

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download.