External Validation and Clinical Utility of Non-Invasive Prediction Models for Non-Alcoholic Fatty Liver Disease in Malaysia

Background: Many prediction models have been developed to detect non-alcoholic fatty liver disease (NAFLD). The drawbacks of many of models are the use of parameters that are not routinely measured locally. This study aimed to evaluate the external validity of a series of prediction models for NAFLD, which were selected based on the routinely measured and tested clinical parameters in public healthcare centers in Malaysia. Methods: A literature search of articles that described the prediction models for NAFLD on adult subjects between 2000 and 2019 was conducted. The validation cohort comprised patients who underwent liver elastography using the Fibroscan® device in a public tertiary care center between January 2017 and December 2019. Both the discrimination and calibration of each model were assessed to determine their predictive performance. Results: Out of the 404 patients undergoing liver elastography, 280 were diagnosed with NAFLD (69.3%). Six prediction models were identied from the existing literature and evaluated. The calibration assessment demonstrated that although three of the models overestimated the NAFLD risk, updating the models generally improved their calibration performance. The discriminative performance of the selected models ranged from 0.717 to 0.783. With a specicity level of 90% and 80%, the sensitivity of all the models fell between 31.1%–48.9% and 46.4%–66.8%, respectively. The Framingham Steatosis Index (FSI) model demonstrated a better predictive performance compared to the other models. Conclusions: The FSI model demonstrates an acceptable predictive performance. Its application in clinical practice could promote the screening and early treatment of NAFLD in the Malaysian population.

To address the limitations of the existing measures, many non-invasive prediction models have been developed to guide the risk assessment of NAFLD. The parameters used in these models range from demographic characteristics, anthropometry measurements, and laboratory ndings to more speci c biomarkers, such as sphingolipids and sterols. The drawbacks of many of these models are the use of parameters that are not routinely measured and tested and the high costs that are involved, which limits their feasibility in healthcare settings.
In Malaysia, the prevalence of NAFLD is high, estimated to fall between 37.4% and 46.0% [11][12][13]. The disease is more common in those who have diabetes (49.6%) [14], hypercholesterolaemia (56.7%) [15], and metabolic syndrome (82.8%) [16]. However, a national screening program for NAFLD is still not in place. The uncertainty about the usefulness of the screening tools and the high cost of the screening tests commonly preclude NAFLD screening in the clinical practice [17]. Furthermore, imaging tests for the NAFLD con rmatory diagnosis were only available in tertiary care centers, which generally have a high patient load and a long waiting time. These limitations often result in missed opportunities to detect early-stage NAFLD, which can be reversed with improved dietary habits and lifestyle changes. Therefore, a reliable and handy prediction model for NAFLD would be bene cial. This study aims to validate a range of prediction models for NAFLD, particularly those that apply the routinely measured and tested parameters in public healthcare centers in Malaysia.

Methods
This study received approval from the Medical Research and Ethics Committee under the Ministry of Health Malaysia (NMRR-20-748-54587). The requirement for informed consent was waived because the data were retrospectively accessed. The study's methods and ndings were in line with the guidelines on the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) [18]. An article was only selected if it did the following: (i) presented the development of a prediction model or an update of a previously developed model for NAFLD, (ii) used the risk of developing NAFLD in the general population as the study's endpoint, (iii) applied multiple parameters or predictors to the model, (iv) developed the model based on weighted risk predictors, (v) provided the full model's linear predictor or prediction algorithm, and (vi) only applied parameters that were routinely measured and tested in public healthcare centers in Malaysia. When a full-text article was not made publicly available, we made up to two attempts to approach the corresponding authors by email. The reference lists of the selected articles were also used to identify additional relevant articles.

Validation cohort
The validation cohort comprised patients seeking care from Hospital Sultanah Bahiyah, a public tertiary care center, which also served as the gastroenterology referral center in northern Malaysia. They were all above 18 years of age and underwent liver elastography using the Fibroscan® device (EchoSens, Paris) between January 2017 and December 2019. Those who had a history of active alcohol consumption (more than 14 drinks per week for men or more than seven for women), viral hepatitis, autoimmune hepatitis, or other forms of chronic liver disease were excluded.
The information on the risk factors or predictors that were used in each prediction model was obtained from the patient's electronic medical records. The predictors included individual socio-demographic and clinical information, ranging from age, ethnicity, gender, education level, marital status, occupation, and body mass index (BMI) to the presence of cardiovascular diseases (diabetes mellitus, hypertension, dyslipidaemia, and coronary artery disease) and the laboratory ndings, including alanine aminotransferase (ALT), aspartate aminotransferase (AST), fasting blood glucose level, triglycerides (TG), and serum cholesterol.
The diagnosis of NAFLD was con rmed by physicians based on the ndings of the liver elastography. The controlled attenuation parameter (CAP) was used to measure the level of hepatic steatosis-a reading above 248 decibels/meter (dB/m) indicated NAFLD [19]. Ten measurements were performed for each patient, and the diagnosis was only con rmed if at least six readings were valid. For the purpose of this study, only the CAP results from the rst liver elastography, as well as the information and laboratory ndings from the patient's clinic visits prior to the rst liver elastography, were gathered.

Statistical analysis
Generally, studies on the external validation of prediction models require at least 100 (or ideally more than 200) events to generate an adequate study sample size [20,21]. To make up for the incomplete information in the validation cohort, the predictive mean matching method was applied to generate ve imputed datasets, which were then pooled using Rubin's rules [22]. The demographic and clinical characteristics of the patients were summarized as either percentages (categorical data) or means and standard deviations (numerical data).
For each patient in the validation cohort, their risk of NAFLD was calculated using the algorithms provided by the selected prediction models. The predictive performance of each model was estimated using discrimination (the ability of a model to differentiate between individuals with and without NAFLD) and calibration (the agreement between the predictions and observed outcomes). The model's calibration was assessed graphically using a calibration plot. A perfect model prediction was expected to be represented by a 45 line with an intercept (α) of zero and a slope (β) of one in the calibration plot [23]. The calibration intercept quanti ed the degree of agreement between the proportion of observed NAFLD cases and the mean predicted probability, which would indicate whether the predictions were systematically too low or too high [23]. On the other hand, the calibration slope referred to the degree of agreement between the predicted probability of developing NAFLD in the present study and the actual probability of having NAFLD [24]. The graph for each model was plotted based on the results of ten groups of a similar number of patients from the validation cohort who had similar predicted probabilities [24].
Direct application of the published models on the current validation cohort might have caused miscalibration, which is characterized by deviations from the ideal line (i.e. calibration-in-the-large was not equal to zero and the calibration slope was less or more than one). In the case of model miscalibration, the prediction model was updated by calculating a correction factor using the following equation [25]: The correction factor was then added to the original model's intercept, and the new intercept was used when the updated model was applied to the validation cohort. 25 This method improved the model's calibration without affecting its ability to discriminate between individuals with and without NAFLD [26]. Furthermore, the model's discrimination was assessed based on the concordance ('c') statistic, which was equal to the area under the receiver operating characteristic (ROC) curve, along with its corresponding 95% con dence interval. Areas under the ROC that were greater than 0.5 suggested that the model could be used to predict NAFLD [27].
Subsequently, the diagnostic accuracy for each updated prediction model was examined using the sensitivity, speci city, positive-and negative-likelihood ratios and the positive and negative predictive values. These diagnostic parameters were calculated using a cut-off value that meant that ten percent of the population had values above the model's cut-off points. The procedure was then repeated using cutoffs where 20%, 80%, and 90% had values above the cut-off. All the data in this study was analyzed using the R statistical software version 3.5.2 (rms, Hmisc, pROC and rmda packages) [28].

Prediction model selection
The search yielded 5985 articles from PubMed, 5899 of which were excluded based on their titles or abstracts. A total of 86 articles ful lled the inclusion criteria, including two that were identi ed from the reference lists of initial selected articles (Figure 1). The six prediction models that were selected for further assessment were the Hepatic Steatosis Index (HSI) by Lee et al. [29], the Fatty Liver Disease Index (FLDI) by Fuyan et al. [30], the ZJU Index (ZJUI) by Wang et al. [31], the Framingham Steatosis Index (FSI) by Long et al. [32], the NAFLD Ridge Score by Yip et al. [33], and the NAFLD Scoring System by Lesmana et al. [34] ( Table 1). The models were developed and published between 2010 and 2017. Two models were from China, and there was one each from Korea, Indonesia, the United States (US), and Hong Kong. The risk algorithms of the selected models are presented in Table 2.
Validation cohort A total of 404 individuals underwent liver elastography over the last three years. More than two-thirds of them (69.3%) were diagnosed with NAFLD. The most common unavailable or undocumented information included the HbA1c level (>15%), white blood cell count (>15%), BMI (10.9%), and AST level (1.7%). NAFLD was found to be more common among those who were older, female, of Malay ethnicity, and had a higher BMI and a larger waist circumference. They were also more likely to be diagnosed with diabetes mellitus (20.7%), hypertension (32.5%), and dyslipidaemia (34.6%). Their characteristics are summarized in Table 3.

Predictive performances
As demonstrated in the calibration plots (Figure 2), the predicted risks were closely clustered around the means in all the models. The FLDI, HSI, and ZJUI were found to have overestimated the NAFLD risk (intercept <0). All the models were over tted (calibration slope <1) except for the NAFLD Ridge score. The FSI was found be the best t for the validation cohort (calibration slope = 0.84). Updating the models generally improved their calibration performance with a calibration line that was much closer to the ideal line ( Figure 3). Nevertheless, the FLDI and NAFLD Ridge score did not show much improvement following the model update. Figure 4 shows the AUROC for all the original prediction models. All models had a fair discrimination with an AUROC of above 0.7. The NAFLD Ridge Score yielded the lowest AUROC (0.717; 95% CI 0.662-0.772). Three models-the FSI, the FLDI, and the ZJUI-demonstrated the best discriminative ability with an AUROC of 0.783, 0.782, and 0.781, respectively. As updating the models did not change the ranking of the predicted risks of the patients, the AUROC of the models remained the same.
The sensitivity, speci city, likelihood ratios, positive and negative predictive values of the updated prediction models are shown in Table 4. By targeting the ten percent of the population that had values above the cut-offs, the HSI, FSI, and ZJUI were able to identify between 46% and 49% of the studied population who potentially had NAFLD. At 20% above the cut-off point, which is equal to the speci city of 80%, the sensitivity increased to 61%-67%. The FSI model was able to identify 47.5% and 66.8% individuals who had NAFLD by targeting the ten percent and 20% as the highest risk, respectively. Nevertheless, low speci city was seen in all prediction models when the sensitivity was above 90%. The NAFLD Ridge Score performed the worst with low sensitivity across all selected cut-offs.

Discussion
In order to identify a predictive model that could be bene cial for identifying NAFLD in Malaysia, this study evaluated the performance of six existing prediction models that apply routinely measured and tested laboratory parameters in the country. The ndings suggest that the FSI model by Long et al. [32], which was rst developed based on the US population, has a better discrimination and calibration performance in predicting NAFLD in the Malaysian population.
The FSI model is simple and practical. It incorporates seven parameters, which are commonly tested and documented not only in hospitals but also in public healthcare centers across the country. This feature enables the early detection and management of NAFLD. The discovery that the three models with a high AUROC value (FSI, FLDI and ZJUI) are the ones that used similar predictors signi es a strong association between BMI, triglycerides, ALT/AST ratios, and an abnormal glucose level with the risk of developing NAFLD. Additionally, the FSI also considers the impact of age, sex, and hypertension on the development of NAFLD in the model's algorithm. Many previous studies found that NAFLD was more common in the elderly and males [35,36] and that hypertension is an independent risk factor for NAFLD [37]. This could explain why the FSI model performs better when discriminating between individuals with and without NAFLD.
The rst application of the original FSI model on the validation cohort demonstrated insu cient calibration. The predicted probabilities that were generated by this model were systematically too low with a calibration intercept of 2.05. Shen et al. reported a similar observation [38]. The systematic underestimation of the NAFLD risk in this study could partly be explained by the difference in incidence of NAFLD between the development cohort (317/1181, 26.8%) [32] and the validation cohort (280/404, 69.3%). To correct the miscalibration, a correction factor was added to the intercept of the original model. By combining the data that was included in the original model with a correction factor that was calculated from the current patient sample, the updated models were adjusted to the local population. Therefore, they yielded better calibration. Consequently, the calibration performance of the updated FSI model was improved with a new intercept that was closer to zero and a calibration line that was closer to the ideal line.
With a good predictive performance, as demonstrated by the updated FSI model in the current study, it is recommended that it is used as a simple screening tool in the local population to determine their risk of having NAFLD. If the model is used in the clinical practice, an empirical impact study on its value in improving the patient care, reducing the burden of the healthcare system, and enhancing the patient satisfaction is warranted.
The main strength of this study is that it is the rst attempt to externally validate non-invasive prediction models for NAFLD and compare their performance using a Malaysian population. Incomplete information in the medical records of the validation cohort was minimal and was imputed to prevent biased results. The strength of this study also lies in the use of the CAP of the Fibroscan® for the diagnosis of NAFLD. The CAP has been shown to perform well and strongly agrees with the ndings of liver biopsies on steatosis [39,40]. A limitation of this study is that it only searched and examined prediction models that were found in articles from the PubMed database. Nevertheless, this database was chosen because it contains a large number of articles that are written in English and covers all major journals across multiple clinical disciplines.

Conclusion
In conclusion, the updated FSI is favorable compared to other similar models to predict NAFLD in the Malaysian population. The application of routinely tested and documented parameters in this model makes it easy to use and practical.