In this cross-sectional study of hepatitis C patients monitored by TE, we describe inter-rater agreement, which has not been previously reported. In our setting, 32% of patients had an inter-rater disagreement above our prespecified threshold. Furthermore, an almost twofold increase or decrease in kPa was required to represent a change in the underlying fibrosis with 95% certainty. In a post-hoc analysis, we found that longer fasting time before TE was associated with better inter-rater agreement.
The two TE operators found different F-scores in 35% of participants, compared to previous studies ranging from 23–35%.7, 8, 11, 19 Although agreement metrics or SDC have not been presented before, a few studies provide Bland-Altman plots, from which LOA95 could be roughly estimated. A study by Fraquelli et al. showed LOA95 of bias ± approximately 4 kPa.9 Another study by Perazzo et al. found systematic bias between operators but also seem to display a LOA95 of bias ± approximately 10 kPa.7 Our LOA95 on the original scale was bias ± 7.7 kPa, although heteroscedasticity (also suggested in previous studies) complicates the comparison. Thus, agreement in the current study seems to be on the lower end compared to previous studies.
Inter-rater reliability, measured by ICC, was 0.86 which could be considered “good to excellent” but is also on the lower end compared to previous TE studies ranging from 0.76 to 0.98.7–11, 20 The Kappa value of 0.64 for F4 vs F0-3 was lower compared to 0.75 and 0.80 in previous studies, but categorised values could be more sensitive if they are close to cut-offs.7, 8 Reliability measures may also be lower in a more homogenous study population. Our population had a lower median LS than in previous studies, with 15% having a unanimous F4 rating, compared to 18–36% in previous studies, possibly indicating a more homogenous population, with lower kPa values.7–9, 11, 19 Reclassification is less likely in the higher range as all values above 12.5 kPa would be F4. To conclude, both larger variations as well as a more homogenous population could have affected reliability negatively.
In a post-hoc analysis, agreement was better when patients had fasted for ≥ 5 hours. Previous studies on fasting time have shown that 150 minutes after a meal, liver stiffness has returned to baseline.21 However, we found that the variability seemed to be increased up to 5 hours after food intake, although in a secondary analysis. Previously, high BMI, liver biomarkers, and high IQR-% have also been associated with invalid TE measurements.22 We did not find an association with these factors and TE inter-rater variability. Findings related to operator experience have been conflicting.8, 22–24 Our results did not suggest systematic differences between operators, although statistical power was insufficient for formal analysis. In two participants, the two operators used different probes (medium and XL), resulting in large kPa differences, as seen in Table 3.
Reliability and agreement metrics were much better in intra-rater than inter-rater analysis, indicating that the change of operator introduced variability. In the intra-rater situation, the operator had more information, i.e., was not blinded to probe placement, probe angle, choice of probe, or the results from the previous exam. Even though the reliability metrics were excellent, the intra-rater SDC95 was 1.40, signalling a higher variability than the IQR-% for the 10 readings of each result. This may be explained by the fact that, in the intra-rater situation, the probe was removed and then replaced, in contrast to the 10 repeated measurements of each LS result. The protocol did not specify that the same probe location should be used, and previous studies suggest variability due to probe location.25
This study has several limitations. We used many operators, who did not have extensive experience in an international context. However, all operators were certified and are performing TE within clinical routine. The sample size was relatively small, making it sensitive to outliers, as elaborated in the supplementary appendix. In the inter-rater analysis, there was information bias, as marks from the preceding procedure could be visible, which could decrease variability if the same probe location was chosen. On the other hand, the operator was blinded to the previous choice of probe (which would normally be documented) and in two cases using different probes resulted in high variability.
This is a small study, and our results need verification in other settings. As TE is increasingly being used in other diagnoses, studies in these should be emphasised as well, including in hepatitis B, In such studies, reporting both reliability and agreement should be encouraged, employing the GRRAS (Guidelines for reporting reliability and agreement studies) guidelines.26 From a clinical perspective, our finding that eight patients (12%) received different F4 ratings is important as this would entail different allocation to screening for hepatocellular cancer. This suggests that values close to cut-offs with clinical importance should be scrutinised, that repeated TE and / or liver biopsy considered in selected patients. Furthermore, the wide range of SDC95 suggest that it may be difficult to determine whether changing TE results in longitudinal monitoring represent measurement error or progressive fibrosis. Lastly, our post-hoc analysis emphasise the importance of fasting before TE and that clinicians may consider postponing TE in the case of insufficient fasting.