Prevalence measures the burden of disease in a population in a given location and at a particular time, representing the proportion of people affected by the disease (1). Estimates of HIV prevalence are frequently used to monitor and study the determinants of HIV epidemic, identify groups at high risk of HIV infection, and to assess the need for HIV prevention and treatment (2).
Population surveys and demographic studies have become the gold standard for estimating national HIV prevalence (3). However, non-response in these surveys is of major concern (4). Individuals may not participate because the interviewers could not contact them for interview or they refuse to give consent to an HIV test (4). Non-response can bias population-based estimates of HIV prevalence if non-response is associated with HIV status in any way. This could occur for two reasons namely refusal to participate in HIV testing because the individual knows his/her status or an individual is involved in high sexual risk behavior (5).
Missing data in research can be classified into three types: one, data missing completely at random (MCAR), which means that missingness is independent of the outcomes and any other observed or unobserved characteristics; two, data missing at random (MAR), that is missingness can be dependent on observed covariates but is independent of the unobserved data and thirdly, data missing not at random (MNAR), that is data are neither missing completely at random nor missing at random. When missing data depends on both the observed and unobserved data, they are considered MNAR (6).
In the population based HIV studies, data can be assumed to be MCAR if the patient gave a blood sample, but the sample was destroyed before it was tested such that the missingness is not associated with their HIV status or any other observed covariate (7). If, however, a patient misses a test, because he had a long way to walk, then data would be MAR, because although missingness is not directly related to their HIV status, it may be related to their residence or other observed covariates, which may, in turn, be associated with the HIV status (8). And finally, MNAR is when an eligible study participant does not come or consent for testing because they already know their HIV status or they have a high probability of being HIV positive or belong to high-risk groups. Here, the missingness depends on the missing HIV status, in which case the MAR assumption is violated. Such mechanism data are considered missing not at random (MNAR) or non-ignorable (9).
When observations are missing completely at random, the missing observations are a random subset of all observations; the missing and observed values will have similar distributions and produce unbiased estimates. However, if observations are MAR there might be systematic differences between the missingness and observed values, but these can be entirely explained by other observed variables. For example, if HIV status is missing at random, conditional on age, sex, residence and marital status, then the distributions of the missing and observed HIV status will be similar among people of the same age, sex, residence and marital status (10). However, if observations are MNAR even after conditioning on the observed covariates, the distributions will differ and any estimates maybe biased (11).
Most researchers use conventional methods such as the complete case or available case analysis where the assumption is data are MCAR. The use of these methods in presence of missing data that are not MCAR results in loss of information and biased estimates of HIV prevalence (12). There has been development of statistical methods that can be applied to adjust for missing data when the missingness is not completely at random. Methods such as inverse probability weighting (IPW), maximum likelihood estimation, multiple imputations and double robust methods can produce less biased estimates.
The IPW methods rely on the intuitive idea of creating a pseudo-population of weighted copies of the complete cases to remove selection bias introduced by the missing data. However, different weighting approaches are required depending on the missing data pattern and mechanism (13). Maximum likelihood estimation and multiple imputations (MI) are the other methods used to adjust for missing data (14). In MI, missing data are replaced by data drawn from an imputation model. This is done M times, generating M complete datasets. Each generated data is analyzed and an estimate of the model parameters is calculated (15). The overall estimate is simply the average of the M estimates and the standard errors of the estimates are obtained using Rubin’s rules (8).
However, in surveys for HIV prevalence, the application of these statistical methods is rare due to their complexity, the extra time needed for the analysis and the availability of software. Depending on the pattern and mechanism of the missingness, some techniques are superior than others.
The objective of this study was to determine the effect of missing data on the estimates of HIV prevalence from a population survey in Tanzania, using complete case analysis, multiple imputation (MI) and inverse probability weighting (IPW).