Consequences of Ignoring Clustering in Linear Regression

doi:10.21203/rs.3.rs-98069/v1

Download PDF

Research article

Consequences of Ignoring Clustering in Linear Regression

https://doi.org/10.21203/rs.3.rs-98069/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 07 Jul, 2021

Read the published version in BMC Medical Research Methodology →

You are reading this latest preprint version

Background

Clustering of observations is a common phenomenon in epidemiological and clinical research. Previous studies have highlighted the importance of using multilevel analysis to account for such clustering, but in practice, methods ignoring clustering are often used. We used simulated data to explore the circumstances in which failure to account for clustering in linear regression analysis could lead to importantly erroneous conclusions.

Methods

We simulated data following the random-intercept model specification under different scenarios of clustering of a continuous outcome and a single continuous or binary explanatory variable. We fitted random-intercept (RI) and cluster-unadjusted ordinary least squares (OLS) models and compared the derived estimates of effect, as quantified by regression coefficients, and their estimated precision. We also assessed the extent to which coverage by 95% confidence intervals and rates of Type I error were appropriate.

Results

We found that effects estimated from OLS linear regression models that ignored clustering were on average unbiased. The precision of effect estimates from the OLS model was overestimated when both the outcome and explanatory variable were continuous. By contrast, in linear regression with a binary explanatory variable, in most circumstances, the precision of effects was somewhat underestimated by the OLS model. The magnitude of bias, both in point estimates and their precision, increased with greater clustering of the outcome variable, and was influenced also by the amount of clustering in the explanatory variable. The cluster-unadjusted model resulted in poor coverage rates by 95% confidence intervals and high rates of Type I error especially when the explanatory variable was continuous.

Conclusions

In this study we identified situations in which an OLS regression model is more likely to affect statistical inference, namely when the explanatory variable is continuous, and its intraclass correlation coefficient is higher than 0.01. Situations in which statistical inference is less likely to be affected have also been identified.

Epidemiology

Clustering

linear regression

random intercept model

consequences

simulation

comparison

bias

Due to technical limitations, full-text HTML conversion of this manuscript could not be completed. However, the manuscript can be downloaded and accessed as a PDF.

Download PDF

Journal Publication

published 07 Jul, 2021

Read the published version in BMC Medical Research Methodology →

Editorial decision: Major revision
13 Jan, 2021
Review #2 received at journal
07 Jan, 2021
Reviewer #2 agreed at journal
20 Dec, 2020
Review #1 received at journal
10 Dec, 2020
Reviewer #1 agreed at journal
29 Nov, 2020
Reviewers invited by journal
29 Oct, 2020
Editor assigned by journal
21 Oct, 2020
First submitted to journal
20 Oct, 2020
Submission checks completed at journal
20 Oct, 2020
Editor invited by journal
20 Oct, 2020

You are reading this latest preprint version

Consequences of Ignoring Clustering in Linear Regression

Status:

Journal Publication

Version 1

Abstract

Figures

Full Text

Status:

Journal Publication

Version 1