Following an unbiased prediction modeling approach can identify known associations as well as propose potential new ones. By analyzing large collections of EMRs from a population at a high probability for NAFLD, we were able to evaluate associations between laboratory observations, comorbidities, and behavioral covariates. We took an unbiased approach for feature selection, a data-driven approach in which a large collection of covariates were considered as candidates for selection without the need for a human domain expert. We used all possible comorbidity covariates defined by the Agency for Healthcare Research and Quality (i.e., CCS codes). Additionally, the covariates included a comprehensive collection of common laboratory covariates as well as traditional factors such as age, gender, and ethnicity. This approach for unbiased feature selection combined with a stringent variable-filtering process provided increased confidence in the results. Our methodology could be used for fast screening of a variety of covariates, risk factors, and their associations and could further be used as a hypothesis generator for additional experiments to be carried out by other researchers.
Following our methodology identified several thousand strongly statistically significant associations between covariates and outcomes in NAFLD. Most of the associations are known, but many others may be new and require further investigation in subsequent studies. Within the scope of a short scientific paper, it was not straightforward for us to decide which findings to describe because we found so many interesting correlations. We thus highlighted only a few findings.
Increased creatinine level, for instance, has been widely reported to be associated with measuring renal functioning in the general population [12,13]. Regarding NAFLD, increased creatinine and reduced eGFR have been studied extensively for their links to kidney malfunction [2,14]. It is also well known that NAFLD is associated with high levels of HbA1c and a high prevalence of T2DM [15]. Consistently, increased levels of HbA1c were associated with T2DM complications (e.g., retinal defects, nephritis, glaucoma). Our findings that T2DM complications (e.g., hyperglycemia or diabetic ketoacidosis) were strongly associated with HbA1c seem reasonable given the broad literature; however, these complications have not yet sufficiently been explored in NAFLD patients.
Increased HbA1c and cardiovascular conditions are well-known associations [16], and our methodology was able to identify this association as well (e.g., Acute MI in Fig. 4a). We also identified a negative correlation between HbA1c and ulcerative colitis, something not captured in the literature. More interestingly, increased levels of HbA1c were correlated with decreased prostate cancer [17], but not with a broad range of other types of cancer. This potential for inverse association with developing cancers, given increased HbA1c, increased BMI, or comorbid conditions related to metabolic syndrome, has already been reported [18–20], but further studies are required to assess such associations more precisely. Notable was the correlation of increased creatinine with decreased diseases of the female genital organs (e.g., prolapse and menopausal disorders). Such an association has not yet been reported and therefore requires further investigation.
As further reassurance that our findings have the potential to identify real-life connections, increased hemoglobin was correlated with decreased fatigue. This finding is sound from a biological plausibility standpoint because hemoglobin is the oxygen-carrying molecule of the body, so its increase in it would be expected to decrease fatigue. These associations are well known in studies focused on specific populations [21] but not in NAFLD. Increased hemoglobin was also associated with benign prostatic hyperplasia and non-epithelial cancers. These findings are an example of a topic for further research because the link between them is not well studied. Additional findings identified by following our methodology were consistent with the literature, such as those for calcium [22] and chloride [23].
SLE has been associated with the development of cervical cancer, and our results were similar [11]. Our results suggest that cervical cancer may precede the development of SLE, contrary to proposals that SLE predisposes a patient to cervical cancer. Autoimmune diseases are well known to have environmental factors that increase the likelihood of disease development. It is possible that an infection with human papillomavirus, one of the well-known risk factors for cervical cancer development, is an environmental trigger [24] that increases the likelihood of developing SLE. Further characterization of this correlation could reveal valuable insight into disease pathogenesis.
Beyond known cardiovascular-related comorbidities and smoking associated with the development of MI, following our methodology identified that an injury related to a vehicle, train, or motorcycle accident, defined by CCS codes as “Motor Vehicle Traffic (MVT),” may be a powerful predictor (OR, 1.92; 95% CI, 1.31–2.74). Although this association had already been proposed [25], it has usually been reported in the lay press, unrelated to NAFLD. Notably, this covariate was not associated with other cardiovascular outcomes in NAFLD. In clinical practice, acute MI is rare after MVT crashes, and cardiac contusion is more likely to result in high cardiac enzymes being miscoded as acute MI. Our observation thereby should be interpreted with caution because it may be simply a coding artifact rather than clinically meaningful.
Although a variety of findings such as age and BMI were associated with the development of varicose veins [26], a surprising covariate was a preceding Parkinson’s disease diagnosis. This finding has not yet been reported. This association offers a variety of possible ideas to explore. For example, maybe there is a subset of Parkinson’s patients who are at increased risk for venous insufficiency. Another possibility is that maybe a subset of patients experiences this as an adverse reaction of medication. It is even possible that only those with NAFLD and Parkinson’s experience this feature. It may also be possible to have a common genetic underpinning for comorbid conditions. Following our methodology can generate such hypotheses relating covariates to NAFLD outcomes, which can further be used for genetic studies.
As expected, being a woman with no history of osteoporosis was strongly associated with the development of osteoporosis. Another known association was that a history of multiple myeloma was associated with the development of osteoporosis [27]. Interestingly, being a patient with established sepsis or multiple sclerosis was associated with the development of osteoporosis. Other known associations, such as the fact that transient ischemic attacks were positively associated with the development of traditional cardiovascular-related risk factors but negatively correlated with albumin, contribute to the increased confidence of the accuracy of following our methodology.
Although following our methodology identified associations between smoking and the development of a variety of diseases, it found that smoking was protective against glaucoma and cataracts. These findings are surprising, though at least one study has reported similar results regarding glaucoma and the potential protective effect of nicotine [28]. Determining a patient’s smoking status (past, present, none) accurately is challenging. We extracted the statuses by using the social history table; however, it could be that the prevalence of current smoking status used in our study was underestimated. Smoking status may be stored in additional resources such as Systematized Nomenclature of Medicine diagnosis codes. Clinical narrative notes also serve as a primary source to document patient smoking status, presenting additional challenges for extracting the statuses accurately [29,30].
In the univariate analysis, we found that osteoporosis was positively correlated with the development of a urinary tract infection (UTI) (2.6% vs. 1.0%; P < 0.000001) among patients with NAFLD. In the multivariate regression, however, the direction of association changed (OR, 0.84; 95% CI, 0.71–0.98). This suggests that interpreting the results of following our methodology depends on the clinical context and thus should be based on general clinical knowledge as well as prior research results. A similar example is the negative association observed between prostate cancer and coronary atherosclerosis. Several publications have reported on results that seem to confirm this reduced risk. Although this association may only be minor and of limited statistical significance [31,32], it may have more of an impact in the NAFLD population.
|