Predictive performance and clinical utility of HCC risk scores in chronic hepatitis C: a comparative study

Many HCC risk prediction scores were developed to guide HCC risk stratification and identify CHC patients who either need intensified surveillance or may not require screening. There is a need to compare different scores and their predictive performance in clinical practice. We aim to compare the newest HCC risk scores evaluating their discriminative ability, and clinical utility in a large cohort of CHC patients. The performance of the scores was evaluated in 3075 CHC patients who achieved SVR following DAAs using Log rank, Harrell’s c statistic, also tested for HCC-risk stratification and negative predictive values. HCC developed in 212 patients within 5 years follow-up. Twelve HCC risk scores were identified and displayed significant Log rank (p ≤ 0.05) except Alonso-Lopez TE-HCC, and Chun scores (p = 0.374, p = 0.053, respectively). Analysis of the remaining ten scores revealed that ADRES, GES pre-post treatment, GES algorithm and Watanabe (post-treatment) scores including dynamics of AFP, were clinically applicable and demonstrated good statistical performance; Log rank analysis < 0.001, Harrell’s C statistic (0.66–0.83) and high negative predictive values (94.38–97.65%). In these three scores, the 5 years cumulative IR in low risk groups be very low (0.54–1.6), so screening could be avoided safely in these patients. ADRES, GES (pre- and post-treatment), GES algorithm and Watanabe (post-treatment) scores seem to offer acceptable HCC-risk predictability and clinical utility in CHC patients. The dynamics of AFP as a component of these scores may explain their high performance when compared to other scores.


Introduction
Chronic hepatitis C infection is a major public health problem with an estimated 71 million persons chronically infected with hepatitis C worldwide [1]. Hepatitis C virus (HCV) is the most common cause of hepatocellular carcinoma (HCC), with an annual incidence of HCC is approximately 3-8% in cirrhotic patients [2]. The use of highly effective and safe direct acting antivirals (DAAs) had revolutionized the management of chronic HCV patients particularly in patients with liver cirrhosis and advanced hepatic fibrosis. The majority of patients with HCV infection are expected to be treated over the next years. Several studies found that viral clearance after DAAs lowered but did not completely eliminate the occurrence of HCC in post-SVR patients [3][4][5][6].
Current guidelines recommend biannual HCC surveillance by ultrasound with or without alpha-fetoprotein (AFP) in patients with cirrhosis [7,8]. According to 1 3 cost-effectiveness analyses, an annual incidence of 1.5% or higher would warrant systematic surveillance of HCC [9]. These recommendations are backed up by data indicating improved survival, a higher rate of early tumor diagnosis and curative treatments among patients undergoing screening for HCC [10]. However, 'one-size-fits-all' strategy increases the health care costs particularly in low-to middle-income countries, with a high HCV prevalence, furthermore, it is estimated that a small percentage of patients with cirrhosis are monitored according to guidelines, highlighting the urgent unmet clinical need for a better prediction model to guide HCC surveillance among patients with advanced liver fibrosis who had SVR [11].
In this context, recently, many HCC risk prediction scores ( Table 1) were developed to guide HCC risk stratification and identify CHC patients who either need intensified surveillance or may not require screening. In the literature, there is no data comparing these different scores, consequently there is a need for direct comparison of the performance, applicability, and clinical utility of these HCC risk scores in independent patient populations. This comparison is highly needed to help the health authorities to properly focus resources towards patients at high risk of development of HCC and to avoid screening of patients with very low HCC risk, also, the assessment of these scores will provide data which is necessary to support the mandatory modification of the current guidelines of HCC screening.
Our aim is to evaluate the newest HCC risk scores comparing their discriminative ability, applicability and clinical utility in a large cohort of CHC patients who achieved SVR following DAAs.

Patients and methods
Cohort 3075 consecutive CHC patients, with liver cirrhosis (F4) or advanced liver fibrosis (F3) who had a sustained virologic response (SVR) after receiving DAAs were included in this observational study.
Between January 2014 and July 2019, patients were recruited from the out-patient clinics at the Egyptian Liver Research Institute and Hospital (ELRIAH) and its satellites throughout the Nile Delta.

Patients' evaluation
All patients in the cohort had their initial data (before treatment) recorded, together with the data in the follow-up visits, up to the last follow-up. HCC incidence, together with expected HCC incidence data, were also recorded. For score comparison, we depended on pre-treatment data (immediately before the onset of DAAs) and post-treatment data (24 weeks after end of treatment).
The diagnosis of HCC was made in accordance with EASL and AASLD guidelines. Multiphase CT or MRI was done to the patient if there were any focal hepatic lesions diagnosed by abdominal ultrasound and/or AFP value > 20 ng/mL MSCT. Diagnosis of HCC was based on the characteristic arterial enhancement and early washout in delayed phase [2,14,15].

Scores selection
We undertook a systematic literature search of PubMed database for studies reporting on HCC prediction scores for HCC over the last 5 years. We used the following search terms: ("Hepatocellular carcinoma" [Mesh] AND "HCV" AND "risk score" [Mesh]. Citations generated by electronic scanning were assessed for relevance based on title, abstracts and key words. Prediction risk scores for which the parameters are available in our cohort were included.

Statistical methods
Statistical analyses were performed using version 26, SPSS (Statistical Package for Social Sciences) (IBM Corp., USA). The follow-up duration was calculated as the time between the end of treatment and the last follow-up, or the date of event development (HCC occurrence), whichever occurred first. Times to events and cumulative incidences were calculated with the Kaplan-Meier method.
The performance of the scores was evaluated using: • Receiver operator curve analysis (ROC) for numeric scores to assess the HCC predictive ability of the score. Accuracy is measured by the area under the ROC curve (AUROC). An area of 1 represents a perfect test; an area of 0.5 represents a worthless test. A rough guide for classifying the accuracy of a diagnostic test is using 0.7 to indicate fair, 0.8 to indicate good and 0.9 to indicate excellent. Acceptable discrimination is indicated when AUROC is > 0.70 [16]. • Evaluating the performance of the risk stratification as a screening procedure against HCC development as the gold standard. Using the risk stratification results, patients are classified into risky group (intermediate and high risk score) and less-risky group (low risk score) and then performance statistics (sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) and accuracy) are calculated. NPV (the probability that subjects with a negative screening test truly do not have the disease) is of special importance in score comparisons [17]. • Log rank (Mantel-Cox) analysis for comparison of incidence curves. p value ≤ 0.05 was considered significant. • Harrell's C statistic [18]. The C-statistic only gives a general idea about a model (goodness of fit measure), especially its discrimination ability. A value below 0.5 indicates a very poor model. A value of 0.5 means that the model is no better than predicting an outcome than random chance. Values over 0.7 indicate a good model. Values over 0.8 indicate a strong model. A value of 1 means that the model perfectly predicts those group members who will experience a certain outcome and those who will not [19,20]. • Overall performance by Brier score. The lower the Brier score is for a set of predictions, the better the predictions are calibrated [21]. • Calibration using Hosmer-Lemeshow test. The output returns a chi-square value (a Hosmer-Lemeshow chisquared) and a p-value. Small p-values mean that the model is a poor fit. Small p-values (usually under 5%) mean that model is not a good fit [19].

Comparison of the scores
The different scores will be evaluated in the same cohort in a stepwise manner: • Statistical performance of the scores using the above mentioned methods. • Scores that are statistically valid will be tested for its applicability and its clinical utility by: (a) studying its ability to stratify patients into the different risk groups either two groups only (high and low), three groups (high, intermediate and low) or four groups where the low is subdivided into low and very low; (b) calculating the percentage of patients in each category together with its 5-year cumulative incidence of HCC; and (c) reporting the AUROC and NPV of the score.

Results
Using the previously mentioned search criteria, 95 papers resulted.

Score exclusion criteria
• Prediction scores that depend on molecular and genetic risk factors which are expensive, not done routinely and were not available for patients in our cohort, like fat-genetic risk score [22], TLL1 [23,24], IFNL3 [25,26], MICA [27] and DEPDC5 [28]. • Scores that need data not routinely available in our dataset, as GGT (FIB4-HCC score by Alonso López et al. [29] and Ganne-Carrié et al. [30]). • Scores that are depending on complicated mathematical methods like El-Serag et al. (HES) score [31] or the methods used for stratification of patients into risk groups was  [32].
Accordingly, eleven scores and an algorithm were included in this study (Table 1).
All these scores were developed and/or validated using data from HCV patients. Only Sharma et al. [33] THRI score was developed and validated using multiple etiologies and the etiology was included as a part of the score whereas Fan et al. aMAP score [37] was developed using HBV patients and it was validated in HBV, HCV and non-viral hepatitis patients. Patients were followed up in all score for more than a year, up to 7.3 years in Sharma et al. [33] THRI score [33] (validation group).

Cohort characteristics
The study included 3075 chronic HCV patients (1037 patients with F3 and 2038 with F4 stage) with SVR who met the inclusion criteria in our study between January 2014 and July 2019. Characteristics of the patients are shown in Supplementary Performance of eleven scores and an algorithm was compared and the results were presented in Tables 2 and 3 Tables 2 and 3 Except for the TE-HCC score of Alonso López et al. [29] and Chun et al. [42], all of the HCC risk scores investigated had adequate statistical performance with significant Log rank (Mantel-Cox) analysis for comparison of incidence curves (p value ≤ 0.05) (Fig. 2).

The performance of each score is listed in
As regards overall performance as measured by Brier score, some scores did not show good performance (Sharma et  HCC-SVR score [42] and Alonso López et al. TE-HCC score [29]).
As regards discrimination ability as measured by Harrell's c statistic, all results were in the range 0.5678-0.832, the lowest Harrell's c statistic was that of Alonso López et al. TE-HCC score [29] and the best was that of Shiha et al. GES algorithm [41].
As regards calibration using Hosmer-Lemeshow test, significant p-value (poor fit) was noted for Sharma et al. THRI score [33] and Hu et al. score [35].

Discussion
Direct comparison of the predictive performance and clinical utility of HCC risk scores in the same patient population have been recently reported in chronic hepatitis B [43]. To the best of our knowledge, this work is the first comparative study assessing the predictive performance of many HCC risk scores in the same large cohort of CHC patients who achieved SVR following DAAs with followup period of more than 5 years. Most of these scores showed acceptable performance for HCC prediction in our cohort and were able to stratify CHC patients into low and risky groups.
All the studied HCC risk scores displayed reasonable statistical performance with significant Log rank (Mantel-Cox) analysis for comparison of incidence curves (p value ≤ 0.05) except TE-HCC score of Alonso López et al. [29], and Chun et al. [42] [37] revealed strong statistical performance with highly significant Log rank (p < 0.001), Harrell's C statistic ≥ 0.64, area under the ROC curve (AUROC) values ≥ 0.73 and very high negative predictive values (> 98%). However, these scores stratified less than 25% of our cohort into the low risk group suggesting the need for HCC surveillance in the remaining 75-85% of patients (Fig. 2). Including the majority of patients in the screening will not only lead to a diminished cost-effectiveness of surveillance program but also may impose substantial physical harms on patients including multiple CT/MRI. Recently Fan et al. agreed that stratifying most of the patients into the high and intermediate risk groups based on the aMAP score, would reduce the cost-effectiveness of the surveillance [44]. This finding showed that good statistical performance of any HCC risk score may not be directly translated to clinical usefulness. It is interesting that the same observation was emphasized during comparison of HCC risk scores in CHB [43].
Conversely, the score of Tani et al. [36] and Abe et al. [39], stratified most of the patients 70.8-84% into the low risk group with 5 years cumulative IR (95% CI) of 2.13 (1.73-2.59) and 2.51 (2.00-3.10), respectively which is higher than all studied scores. A large number stratified into the low risk group means that a lot of risky patients will not be screened, consequently many HCC cases will be missed, so HCC surveillance cannot be safely avoided.  [35], GES score [40], GES algorithm [41] and ADRES [38]; the score of Hu et al. stratified the patients into two categories only low and high-risk groups, consequently, a large number (41%) of our patient cohort should undergo more intense screening which may lead to reduced cost-effectiveness and increased physical harms. Screening of this large number of patients using HU  Finally, after evaluation and comparison of these eleven HCC risk scores we ended with four scores namely Watanabe et al. (post-treatment), GES score, GES algorithm and ADRES score. These scores were clinically applicable being simple, easy to calculate and based on readily available clinical and laboratory parameters. In addition, these scores demonstrated good statistical performance; Log rank analysis < 0.001, Harrell's C statistic (0.66-0.83) and high negative predictive value (94.38-97.65%). Also, these scores stratified our patients successfully into low, intermediate, and high groups with very low 5 years cumulative IR (0.54-1.6) in the low risk group which is about 50% of the cohort, so surveillance could be avoided safely in approximately half of the patients. On the other hand, the high-risk groups had high 5 years cumulative IR in about 20% of the patients only, for whom more intense screening may be required. These scores had intermediate risk group with relatively high 5 years cumulative IR in about one third of patients who may need to continue screening according the current guidelines. Also, these scores had relatively good Brier score (as an indication of overall performance) and non-significant Hosmer-Lemeshow test, indicating good calibration and that they had a good fit.
The good performance of GES score and algorithm may be explained not only by the fact that it was derived from a similar population with the same HCV genotype but also that it included both F4 and F3 in its components [40,41]. It should be noted that the study cohort included both patients with F3 and F4 as well as the cohorts from which Watanabe et al. [34] and ADRES [38] scores were derived as FIB-4 > 3.25 i.e. F3 and F4.
There is an argument about HCC surveillance after SVR in individuals with cirrhosis (F4) vs. pre-cirrhotic advanced fibrosis (F3). EASL supports ongoing surveillance in patients with advanced fibrosis (F3) whereas AASLD does not [7]. Ioannou in his recent review of HCC surveillance after SVR in patients with F3 and F4 attempted to explain this important issue by reporting that this disagreement is due to the difficulty in precise determination of patients with F3 fibrosis and the fact that they are a heterogeneous group with some patients having F3-F4 fibrosis and higher HCC risk and others having F2-F3 fibrosis with lower risk. Furthermore, there is possibility for misclassification of cirrhosis; certain patients are under staged by biopsy or noninvasive markers of fibrosis and hence their risk of HCC is underestimated [11]. HCC that occurred in non-cirrhotic patients as advanced fibrosis (F3) often diagnosed at late stages due to low index of suspicious and lack of screening which is a significant gap in the clinical care. As a result so there is unmet need for scores that can help to identify patients with F3 who have a high enough HCC risk that warrant HCC surveillance.
It is not clear why Watanabe et al. (post-treatment) and ADRES scores had good statistical performance and clinical utility in our cohort although they were derived from totally different population and genotypes. However, careful analysis of the components of these scores showed that they are very similar including age, fibrosis stage and AFP; this similarity of the components of these HCC risk scores and its relation to the clinical utility was recently highlighted by Voulgaris et al. [43] who stated that in addition to the predictive performance, the components of each score and its formulas are crucial factors for the clinical utility of any risk score. Interestingly it is not AFP cut-off but its post-treatment changes and dynamics during the follow-up. Although AFP is the most widely used biomarker in HCC surveillance, it is not included in the international guidelines based on its suboptimal sensitivity and specificity [43][44][45][46]. However, recently several reports confirmed longitudinal AFP measurement rather than an absolute cut-off value [47,48] may further increase the sensitivity of HCC detection. Based on these reports and our findings, we suggest that a good score for HCC risk prediction should include the dynamics of AFP during the follow-up. This suggestion may be reinforced by the observation that the scores of Watanabe et al., (pre-treatment), Fan et al. (aMAP score), and Sharma et al. (THRI score) which did not include AFP in their components were not clinically useful in our patient cohort although they displayed the highest statistical performance among all studied scores.
The strength of this study is that it is the first direct comparison of recently published HCC risk prediction scores in CHC. This work was done in the same large cohort of CHC patients with follow-up period more than 5 years allowing assessment of the statistical performance, applicability, clinical utility, and potential cost-effectiveness of these scores. The results of this work may pave the way for related methodology together with objective and validated clinical interpretation of HCC risk scores.
The study has some limitations; it reflects the judgment of one single center and most patients were predominantly genotype 4 so validation in cohorts of different ethnicities and other HCV-genotypes may be required before any further recommendations as HCV genotype showed increased HCC risk in different populations; Patients with HCV genotype 3 is known to be associated with a higher incidence of HCC [49]. Also, HCV genotype 6 increased the Risk for Hepatocellular Carcinoma among CHC Asian patients with liver cirrhosis [50]. Second; the good performance of GES score and algorithm may be explained by the fact that it was derived from similar patient population.
In conclusion, many scores including parameters which were not readily available or had complicated formulas restricted their applicability in clinical practice. Statistical performance of HCC risk scores in CHC may not be directly transferable to clinical utility. ADRES, GES score, GES algorithm, Watanabe et al. (post-treatment) scores seem to offer high predictability and clinical utility for HCC in CHC patients. The dynamics of AFP as a component of these three scores may explain the good clinical usefulness when compared to other scores.