Background
Clustering of observations is a common phenomenon in epidemiological and clinical research. Previous studies have highlighted the importance of using multilevel analysis to account for such clustering, but in practice, methods ignoring clustering are often used. We used simulated data to explore the circumstances in which failure to account for clustering in linear regression analysis could lead to importantly erroneous conclusions.
Methods
We simulated data following the random-intercept model specification under different scenarios of clustering of a continuous outcome and a single continuous or binary explanatory variable. We fitted random-intercept (RI) and cluster-unadjusted ordinary least squares (OLS) models and compared the derived estimates of effect, as quantified by regression coefficients, and their estimated precision. We also assessed the extent to which coverage by 95% confidence intervals and rates of Type I error were appropriate.
Results
We found that effects estimated from OLS linear regression models that ignored clustering were on average unbiased. The precision of effect estimates from the OLS model was overestimated when both the outcome and explanatory variable were continuous. By contrast, in linear regression with a binary explanatory variable, in most circumstances, the precision of effects was somewhat underestimated by the OLS model. The magnitude of bias, both in point estimates and their precision, increased with greater clustering of the outcome variable, and was influenced also by the amount of clustering in the explanatory variable. The cluster-unadjusted model resulted in poor coverage rates by 95% confidence intervals and high rates of Type I error especially when the explanatory variable was continuous.
Conclusions
In this study we identified situations in which an OLS regression model is more likely to affect statistical inference, namely when the explanatory variable is continuous, and its intraclass correlation coefficient is higher than 0.01. Situations in which statistical inference is less likely to be affected have also been identified.