Using Multiple Imputation and Inverse Probability Weighting to Adjust for Missing Data in HIV Prevalence Estimates: A Cross-Sectional Study in Mwanza, North Western Tanzania.

Serological samples were collected from participants who were resident in a Demographic Surveillance System (DSS) in Kisesa, Tanzania. HIV prevalence was estimated using three methods. Firstly, using the Complete case analysis (CCA), assuming data were Missing Completely at Random (MCAR). The other two methods, multiple imputations (MI) and inverse probability weighting (IPW), assumed that non-response was missing at random (MAR). For MI, a logistic regression model adjusting for age, sex, residence, and marital status was used to impute 20 datasets to re-estimate the HIV prevalence. Propensity for participating in the sero-survey and being tested for HIV given age, sex, and marital status were generated using Using inverse probability were for who were tested


Background
Prevalence measures the burden of disease in a population in a given location and at a particular time, representing the proportion of people affected by the disease (1). Estimates of HIV prevalence are frequently used to monitor and study the determinants of HIV epidemic, identify groups at high risk of HIV infection, and to assess the need for HIV prevention and treatment (2).
Population surveys and demographic studies have become the gold standard for estimating national HIV prevalence (3). However, non-response in these surveys is of major concern (4). Individuals may not participate because the interviewers could not contact them for interview or they refuse to give consent to an HIV test (4). Non-response can bias population-based estimates of HIV prevalence if non-response is associated with HIV status in any way. This could occur for two reasons namely refusal to participate in HIV testing because the individual knows his/her status or an individual is involved in high sexual risk behavior (5).
Missing data in research can be classi ed into three types: one, data missing completely at random (MCAR), which means that missingness is independent of the outcomes and any other observed or unobserved characteristics; two, data missing at random (MAR), that is missingness can be dependent on observed covariates but is independent of the unobserved data and thirdly, data missing not at random (MNAR), that is data are neither missing completely at random nor missing at random. When missing data depends on both the observed and unobserved data, they are considered MNAR (6).
In the population based HIV studies, data can be assumed to be MCAR if the patient gave a blood sample, but the sample was destroyed before it was tested such that the missingness is not associated with their HIV status or any other observed covariate (7). If, however, a patient misses a test, because he had a long way to walk, then data would be MAR, because although missingness is not directly related to their HIV status, it may be related to their residence or other observed covariates, which may, in turn, be associated with the HIV status (8). And nally, MNAR is when an eligible study participant does not come or consent for testing because they already know their HIV status or they have a high probability of being HIV positive or belong to high-risk groups. Here, the missingness depends on the missing HIV status, in which case the MAR assumption is violated. Such mechanism data are considered missing not at random (MNAR) or non-ignorable (9).
When observations are missing completely at random, the missing observations are a random subset of all observations; the missing and observed values will have similar distributions and produce unbiased estimates. However, if observations are MAR there might be systematic differences between the missingness and observed values, but these can be entirely explained by other observed variables. For example, if HIV status is missing at random, conditional on age, sex, residence and marital status, then the distributions of the missing and observed HIV status will be similar among people of the same age, sex, residence and marital status (10). However, if observations are MNAR even after conditioning on the observed covariates, the distributions will differ and any estimates maybe biased (11).
Most researchers use conventional methods such as the complete case or available case analysis where the assumption is data are MCAR. The use of these methods in presence of missing data that are not MCAR results in loss of information and biased estimates of HIV prevalence (12). There has been development of statistical methods that can be applied to adjust for missing data when the missingness is not completely at random. Methods such as inverse probability weighting (IPW), maximum likelihood estimation, multiple imputations and double robust methods can produce less biased estimates.
The IPW methods rely on the intuitive idea of creating a pseudo-population of weighted copies of the complete cases to remove selection bias introduced by the missing data. However, different weighting approaches are required depending on the missing data pattern and mechanism (13). Maximum likelihood estimation and multiple imputations (MI) are the other methods used to adjust for missing data (14). In MI, missing data are replaced by data drawn from an imputation model. This is done M times, generating M complete datasets. Each generated data is analyzed and an estimate of the model parameters is calculated (15). The overall estimate is simply the average of the M estimates and the standard errors of the estimates are obtained using Rubin's rules (8).
However, in surveys for HIV prevalence, the application of these statistical methods is rare due to their complexity, the extra time needed for the analysis and the availability of software. Depending on the pattern and mechanism of the missingness, some techniques are superior than others.
The objective of this study was to determine the effect of missing data on the estimates of HIV prevalence from a population survey in Tanzania, using complete case analysis, multiple imputation (MI) and inverse probability weighting (IPW).

Data Source
Data were obtained from Kisesa observation HIV cohort study in Magu District, Mwanza Region, Northwestern Tanzania. This cohort is located within a Health and Demographic Surveillance System (HDSS) which had the baseline census in 1994 and then regular household visits to record all births, deaths and migration. Currently there are 34 completed rounds of HDSS (16). HIV and other infectious diseases are monitored in the cohort using a series of epidemiological serological surveys to measure the HIV status of residents at three-year intervals from 1994 to 2016, and currently there are 8 completed serological surveys.
This study used data from HDSS round 30 (2015) and sero-survey round 8 (sero8) implemented during 2015/2016. All residents (aged 15 years and above) from Kisesa HDSS round 30 were eligible to take part in sero8. Participants were invited through invitation slips, informing them about the location of the temporary clinic and their date of participation. At the clinic, all participants were requested for their written consent to participate in the survey and testing for HIV. Consents for the minors (under the age of 18 years) were obtained at home from parents or guardians and assent provided by the minor at the clinic. During the sero8 operations, participants were interviewed using a structured questionnaire to report on their socio-demographic characteristics. Blood samples were collected through nger prick and tested for HIV antibodies using Alere Determine™ HIV-1/2 rapid test for screening and Trinity Biotech Uni-Gold™ HIV rapid test for con rmation.

Statistical methods
The outcome of interest was HIV status (positive/negative) with HIV prevalence estimated using three methods: Complete case analysis on the sero8 survey data alone assuming HIV status through nonattendance at the survey, to be missing completely at random (MCAR); Multiple imputation (MI) and inverse probability weighting (IPW) methods, which assumed data to be missing at random (MAR), with attendance at the survey dependent on age, gender, residence and marital status.
In the complete case analysis, all participants with missing HIV status or missing any of the covariates were excluded from the analysis. Participants who had missing HIV status were treated as a random subset of the complete sample of subjects, and, the set of participants with no missing HIV status were also treated as a random sample from the source population (7). This approach can only result in unbiased estimates when it is demonstrable that missing data are not associated with HIV status in any way (17).
Multiple imputations (MI) involved imputing values for the missing HIV status, for those who did not attend the sero8 survey, based on age, sex, residence and marital status (12). We imputed 20 datasets (M=20) using the Markov Chain Monte Carlo (MCMC) algorithm with a binomial distribution replacing each missing HIV value with values consistent with that person's age, sex, residence and marital status.
After imputation, each dataset was used to estimate the HIV prevalence using logistic regression. The 20 estimates of HIV prevalence were averaged to come up with a pooled estimate. The Rubin's rules were used to combine the average standard error and obtain the 95% con dence interval for the pooled estimate (18).
For IPW, we rst used a logistic regression model to estimate the propensity scores for participating in the sero-survey and being tested for HIV given age, sex, residence and marital status as the covariates. Propensity scores (PS) obtained from the models balanced the distribution of observed baseline covariates for those tested for HIV and those not tested. Using the propensity scores, p(x), we derived inverse probability weights (IPW) for participants who were tested for HIV.
The inverse probability weights were normalized to re ect the age, sex, residence and marital status of the HDSS population, and the HIV prevalence was estimated using the normalized inverse probability weights.

Results
Description of the study participants Figure 1 shows that a total of 21857 participants aged 15 years or older were resident in the cohort, 19985 (91%) were seen in the HDSS survey, 7490 (34%) enrolled in the sero8 survey with 5618 (26%) seen in both HDSS and sero8. The 1872 (9%) participants not in HDSS were new residents, had moved into the area after the HDSS survey. More than 70% of the eligible participants did not attend the corresponding sero-survey, hence missing the HIV status (Figure 1). A ow diagram below shows the enrollment of the study participants.

Study characteristics of the participants
In this population aged 15 years and above, there were 10,150 (46%) males and 11,706 (54%) females, with a 10,755(49%) married participants compared to 7,543 (36%) who were single and 2,829 (13%) who were separated or widowed. For areas of residence, overall, there were 11,274 (52%) from rural areas and 10,578 (48%) from urban areas. A larger percentage of the participants, 4,752 (22%) in this study were in the 15-19 age group, with the lowest number of participants, 779 (4%) in the 55-59 age category. There were differences in the proportions in these categories between those who attended sero8 and those who were seen in the HDSS (Table 1). HIV prevalence -A complete case analysis Figure 2 shows the HIV prevalence and 95% CI estimates by sex and age groups for those who attended the sero8 survey. In all age groups, except for the 35-39 age group, females had a higher HIV prevalence than males.
Generally, tables 4 and 5 showed that HIV prevalence increased with an increase in age, from the minimum age group to 35-39 for males and 40-44 for females when it started to decrease. Those who were separated or widowed had the highest HIV prevalence with the lowest HIV prevalence amongst the single never married participants. Estimating HIV prevalence by residence had similar estimates for all the three methods.
There was an increase in HIV prevalence estimates after adjusting for missing data using multiple imputations and inverse probability weighting methods. The estimates obtained using multiple imputations were slightly larger than those obtained using inverse probability weighting and the 95% con dence intervals for MI were narrower than those obtained using IPW and CCA for both sexes. The age and sex pattern for HIV prevalence was similar for MI and inverse probability weighting methods. The separated/widowed participants had the highest HIV prevalence. Urban residence had a higher HIV prevalence than rural residents but the difference was not statistically signi cant using the three approaches.
In this study, females had a higher HIV prevalence than males using the three approaches, that is, more females were HIV positive than males, with the lowest estimates among participants aged 15-19 years which maybe because most of these participants were of school going age, not yet married, and may not have had sexual debut (20). Participants between 25-59 years had high HIV prevalence as most of them are sexually active and have multiple partners. The lower HIV prevalence among those aged 60 years and above was a result of potentially lower sexual activities in the group (21).
The separated or widowed participants had the highest HIV prevalence, as some may have had partners infected with HIV who have died or divorced (20). Single participants had the lowest HIV prevalence under CCA, MI and IPW methods, as many were young and not involved in sexual relationships. Variations in HIV prevalence were also a result of place of residence. Urban residents had high HIV prevalence than the rural residents but the difference was not statistically signi cant (p=0.38). The insigni cant difference between the HIV prevalence between rural and urban residents could be explained by the fact that the entire area of Kisesa is becoming more urbanized and access to rural areas has increased a lot in the recent times.
We found that there were minor differences in HIV prevalence estimates obtained using each of the methods i.e complete case analysis, multiple imputation and IPW. However, in some speci c groups MI and IPW produced narrower con dence interval estimates. The complete case analysis method ignores the missing data hence can underestimate the HIV prevalence. A systematic review which looked at the analytical methods used in estimating the prevalence of HIV/AIDS from demographic and cross-sectional surveys with missing data recommended the use of advanced methods to adjust for missing data in the analysis of HIV survey data to reduce bias in the estimates. Failure to adjust for missing data may result in biased estimates of parameters of interest (22).
The HIV prevalence estimated using the methods that assumed the missingness was MAR were 2-3% higher than the complete case analysis which assumed MCAR. Thus, the assumption of MCAR gave a biased estimate of the HIV prevalence, which concurs with the conclusions of a systematic review of missing data in HIV prevalence estimation (22). Our results were consistent with Mwambi and Chinomona who found that the prevalence of HIV was underestimated by complete case analysis, with the conclusion that multiple imputation provided a more accurate estimation of the HIV prevalence in the presence of missing data (20). In another analysis using multiple imputations, complete case analysis provided ine cient though valid results when missing data are MCAR, but biased results when data were MAR. Multiple imputation approach led to unbiased results with correct standard errors, in situations where data were MCAR or MAR (7). A simulation study indicated that it's not advisable to use complete case analysis especially if the proportion of missing values is high (23). With IPW, assuming no model misspeci cation, the prevalence estimates are corrected from the bias introduced by CCA analysis irrespective of the sample size as the standard errors are larger compared to IPW. (24).
Multiple imputation generally had the highest HIV prevalence estimates in most of the covariates, and the 95% con dence intervals were narrower than the complete case and the IPW methods. This re ects the effects of the extra precision the MI introduces in the estimation process (20). The 95% con dence intervals for CCA and IPW were similar because IPW and CCA are restricted to the sample who were tested for HIV, and the only difference was that using IPW we weighted the estimates in respect of their covariates observed in calculating the prevalence. In contrast the MI method imputed data for the missing HIV status, and the extra information made the standard errors smaller resulting in narrower 95% con dence intervals which were more precise.

Conclusions
Estimating HIV prevalence from population and survey data is prone to bias when the assumptions about missing data are incorrect. Robust statistical methods have to be employed in order to properly account for missing data. Both multiple imputation and IPW are able to account for missing data.
The results of this study showed that multiple imputation (MI) is a reliable method for estimating HIV prevalence in the presence of missing data. This method was more superior to the complete case and the IPW approaches as it did not underestimate HIV prevalence and had tighter 95% con dence intervals.
Therefore, we recommend the use of MI in estimating HIV prevalence to address the problem of varied types of missing data. Thus, based on the MI estimations, overall HIV prevalence in Kisesa was 6.8% and higher among females with 7.4% (95% CI: 6.6-8.2) than males with 6.2% (95% CI: 5.1-7.3). Better results could have been obtained if more covariates were used for MI or IPW.

Study Limitations
The potential limitation of this study is the use of secondary data. There are more variables that could have been used in the multiple imputations and propensity score estimation. Adding these variables will further improve the estimation of the bias introduced in the complete case analysis. Further analysis is needed in order to determine which method is best. At the clinic, all participants were requested for their written consent to participate in the survey and testing for HIV. Consents for the minors (under the age of 18 years) were obtained at home from parents or guardians and assent provided by the minor at the clinic.

Consent for publication Not Applicable
Availability of data and materials The datasets used and/or analysed during the current study are available from the corresponding author on a reasonable request. Data will also be available in the London School of Hygiene and Tropical Medicine repository.