We applied the advantages of ML to research using traditional statistical methods and had two main findings. First, the increased R2, pseudo-R2, C-statistic values, and the decreased AIC values observed in feature selection by ML compared with that observed in feature selection by point-biserial r suggest that ML techniques can handle large variable numbers and are useful for elucidating novel factors related to multifactorial diseases. Second, the weak association between MaxGD and eGFR demonstrates that MaxGD represents morbid GH rather than compensatory GH.
Traditional statistics based on a priori knowledge with a hypothesized model may delay clinical research progress, as few novel prognostic features are addressed in each study 3. Contrastingly, ML has the power to reduce the dimensionality of variables 3,4, which is based on its intrinsic ability to formulate a predictive data-based model. Performing ML first allows us to account for potential linear and non-linear predictors without unconscious bias, avoiding a priori choice among potential predictors. Nonetheless, ML methods neither assess statistical inference nor offer valid inferences on feature importance; therefore, statistical testing should be performed. Considering that the advantages of ML techniques are flexibility and lack of a priori assumptions and that the advantages of traditional statistical approaches are simplicity and transparency of understanding 3,4, the order of our analysis in the present study was to first narrow down the number of features by ML and subsequently validate the features by traditional statistics, such as R2, pseudo-R2, AIC, and C-statistic. To overcome the overfitting problem 11 and generalize the ML results, we focused on the ability of ML to select features instead of focusing on its predictive models.
Furthermore, while traditional regression analyses suffer from multicollinearity in the presence of many variables, ML techniques alleviate collinearity limitations by leveraging penalization approaches 25,26. Interestingly, the features selected by the ML method showed lower VIFs than those selected by point-biserial r, indicating that SR via GP is effective in avoiding the introduction of collinear variables. SR via GP 27,28 is a versatile and heuristic model that allows detailed analyses, even in datasets with a small sample size, partly because it can control the complexity of models to prevent overfitting of the training data. We confirmed that SR via GP is useful for identifying new risk factors and discriminating unrelated factors. As we routinely perform SR as the ML method, we used this technique. SR showed excellent results in the ML methods examined (Fig. 5, Table 5).
Table 5
AUC and other statistical metrics in nine machine learning models
Machine learning model
|
AUC
|
Accuracy
|
Precision
|
Recall
|
F-score
|
Linear Regression
|
0.536
|
0.488
|
0.333
|
0.467
|
0.389
|
Lasso Regression
|
0.736
|
0.698
|
0.563
|
0.600
|
0.581
|
Ridge Regression
|
0.695
|
0.651
|
0.500
|
0.533
|
0.714
|
Logistic Regression
|
0.662
|
0.721
|
0.636
|
0.467
|
0.857
|
Naïve Bayes
|
0.548
|
0.512
|
0.385
|
0.667
|
0.488
|
SVMs
|
0.615
|
0.721
|
0.800
|
0.267
|
0.964
|
Random Forrest
|
0.631
|
0.721
|
0.714
|
0.333
|
0.929
|
XG Boost
|
0.826
|
0.698
|
0.563
|
0.600
|
0.750
|
Symbolic regression via GP
|
0.774
|
0.767
|
0.727
|
0.533
|
0.615
|
Abbreviations: AUC, area under the curve; GP, genetic programming. |
Although GH is multifactorial and occurs in different pathophysiological conditions 7, it is technically difficult to distinguish true valuable risk factors from the many risk factors using a conventional statistical method. However, the significant power of factor selection via ML makes such discrimination possible by creating rankings for predictive features of patients with multifactorial chronic diseases. Here, the ML score (GP) for MaxGD ≥ 242.3 µm identified the top 8 features, including BMI, complement C3, serum total protein, arteriolosclerosis, urinary protein excretion (U-Prot) during the 10-year follow-up, edema, C-reactive protein, and the Oxford E1 score. Furthermore, eGFR was ranked 46th in the ML scores (GP), indicating that MaxGD is more relevant for current injuries, such as vascular damage, inflammation, and obesity, rather than past injuries represented by nephron disappearance.
Aging is associated with nephron loss/low eGFR; however, it remains controversial whether these findings are pathological or not 29–32. Therefore, it is desirable to distinguish compensatory hypertrophy due to nephron loss/low GFR from morbid hypertrophy due to disease activity 7. The key to understanding this question is the rightward shift 33 in the glomerular size distribution caused by nephron loss/low GFR and the threshold for morbid GH 7,33. The five-sixths nephrectomized (or subtotal nephrectomized) model is a frequently used animal model of progressive kidney failure by nephron loss/low GFR. The glomerular diameter in subtotal nephrectomized rats increased to approximately 1.5 times the glomerular diameter of the control group, while that of the non-nephrectomized rats increased to approximately 1.1 times the glomerular diameter of the control group 34. Therefore, GH > 1.5 times its original diameter (2.25 times area and 3.38 times volume, assuming glomeruli are spherical) may be morbid 7. In a recent study examining individual glomerular size using magnetic resonance imaging-based glomerular morphology in a mouse model 35, the increase in glomerular volume due to aging showed a rightward shift in the glomerular size distribution in the range of < 3 times the volume. Similar results have been observed in humans 30. Furthermore, while American women with low nephron numbers did not demonstrate GH, American men with low nephron numbers showed marked GH 36. Thus, compared with nephron loss, glomerular size could be a more direct indicator of disease severity.
To the best of our knowledge, our study is the first to identify predictive features for MaxGD using ML. The methodological novelty of our research included combining the exploratory strengths of ML with the validation strengths of conventional statistical methods, and our findings will contribute significantly to future clinical research. Furthermore, our findings on the discrimination between compensatory GH caused by nephron loss and pathological GH have significant clinical importance in nephrology.
One limitation of this study is that this study was observational; thus, any observed association does not prove causality. Although countermeasures for small sample sizes 13,14 were adopted, such as permutation tests and LOO-CV using SR via GP, the small sample size should be noted. Furthermore, since the comparison of superiority and inferiority between ML methods was not the main focus, we did not adopt the approach of a study on ML comparing the hit rates of predictive models created by ML. Lastly, although one of the strengths of ML is that it allows exploration of non-linear relationships, complete verification may not be possible with conventional statistical methods when non-linear factors are selected by ML. However, the study results, such as the increase of pseudo-R2 and R2 in the model based on factors selected by SR, indicate that the factor selection ability of ML is also excellent for linear factors.
In conclusion, ML demonstrated a weak association between MaxGD and eGFR, indicating that MaxGD represents morbid GH rather than compensatory GH. Furthermore, in a comparative validation using a conventional statistical technique, feature selection by ML avoided collinearity and increased pseudo-R2 and R2 values to a greater degree than feature selection using point-biserial r. Moreover, ML may be useful for identifying unknown risk factors and unrelated factors. Our method may be generalized to other types of medical research because of the procedural simplicity of using top-ranked features selected by ML.