A practical research method integrating machine learning with conventional statistics has been sought in medicine. Although glomerular hypertrophy (or a large renal corpuscle) on renal biopsy has pathophysiological implications, it is often misdiagnosed as adaptive/compensatory hypertrophy. We explored the factors associated with a maximal glomerular diameter of ≥242.3 μm using machine learning.
Using the frequency-of-usage feature ranking in predictive models, we defined the machine learning scores calculated from symbolic regression models. We compared features selected by genetic programming with those selected by a point-biserial correlation coefficient using multivariable logistic and linear regression to validate discriminatory ability, goodness-of-fit, and collinearity.
Body mass index, complement component C3, serum total protein, arteriolosclerosis, C-reactive protein, and the Oxford E1 score were ranked among the top 10 features with high machine learning scores using genetic programming. The estimated glomerular filtration rate was ranked 46th among the 60 features. In multivariable analyses, the R2 value was higher (0.61 vs. 0.45), and the corrected Akaike Information Criterion value was lower (402.7 vs. 417.2) in the model with features generated by genetic programming than in the model generated with features using point-biserial r. There were two features with variance inflation factors higher than 5 in the model using point-biserial r and none in the machine learning model.
Machine learning may be useful in identifying significant and insignificant correlated factors. Our method may be generalized to other medical research due to the procedural simplicity of using top-ranked features selected by machine learning.