Study Context and Data Source
This study grew out of a broader project that aimed to investigate the impact of infectious disease outbreaks (i.e., Ebola and COVID-19) on the use of health services in the DRC using data from the national health management information system (HMIS), a DHIS2 enabled RHIS [12]. In the DRC’s HMIS, health facilities are expected to report the number of visits delivered each month using a standardized paper form. These paper forms are then transferred to the health zone office and are entered into a centralized database by a skilled health professional. Each health facility is normally expected to report on a complete set of health indicators every month. We defined the lack of cases reported for a given indicator at a given health facility for a given month to be a missing value in our study.
This context was selected partially because the DRC represents an interesting and challenging international context in which to test these techniques but also because many members of the research team have been working closely with RHIS data in this setting for years, thus making it a convenient location to undertake this study; however, we believe the DRC’s HMIS shares many common features with RHIS in many LMICs, in particular with those in Sub-Saharan Africa. The DHIS2 system began its implementation in the DRC in 2014; however, only in 2017 did it achieve a national-level scale [13]. From the entire sample of 18,138 facilities in the DRC, we identified 5,510 facilities that had reported every month between January 2017 and October 2020. As the COVID-19 pandemic began in the DRC in March 2020, we considered all data before this month as pre-COVID-19 and data collected since March 2020 as after the onset of COVID-19 or during the pandemic.
Imputation Methods
In terms of the data imputation methods evaluated, we first selected the ones utilized in past RHIS studies, i.e., exclusion, interpolation, and mean imputation [11]. We also examined other algorithms that have been used extensively in imputing missing values, namely random forest, multiple imputation, k Nearest Neighbour (k-NN), and seasonal decomposition (examples by AK Waljee et al., 2013 and DJ Stekhoven & P Bühlmann, 2012) [4,14]. The simplest method, mean imputation, was also included as a baseline for comparison purposes. Table 1 provides a summary of the six imputation methods examined along with a brief technical description for each method. All analyses were conducted through statistical software R [15].
Table 1: Summary of imputation methods
Name of Imputation Method
|
Description
|
R package
|
Level of complexity to implement
|
1. Mean imputation
|
Missing values are replaced with the average of the entire non-missing population in the same month.
|
N/A
|
Easy
|
2. Exclusion & Interpolation
|
Firstly, any facilities with three or more consecutive missing monthly reports are excluded. Next, missing values in the remaining facilities are filled with interpolation.
|
N/A
|
Easy
|
3. Nonparametric Missing Value Imputation using Random Forest (missForest)
|
missForest is a relatively new Random Forest-based method, which treats the variable with missing values as a dependent variable and regresses it against all the other variables in the dataset through a random forest model. This process is repeated iteratively, and in each step, the missing values are filled with a better prediction. The iteration stops when some threshold is met, i.e., when the changes in the imputed values between steps become small enough. This method is popular because of its ability to handle both categorical and numerical data, as well as very little manual parameter tuning required in the implementation [4].
|
missForest
(DJ Stekhoven & P Bühlmann, 2012) [14]
|
Moderate
|
4. Multiple Imputation
|
Multiple imputation also treats the variable with missing values as a dependent variable and estimates it based on the rest of the variables. This estimation is repeated multiple times (M times) with a random component involved and being slightly different in each estimation to account for the uncertainty in the missing values. M datasets with slightly different estimations of the missing values are returned at the end of the estimation procedure and taking an average across the M estimations yields an unbiased estimate of the missing values. The multiple imputation by chained equations (mice) implementation in R, in particular, enables an iterative estimation of missing values in multiple variables and provides flexibility in imputing both categorical and continuous variables [16].
|
mice
(SV Buuren & K Groothuis-Oudshoorn, 2010) [17]
|
Moderate
|
5. k Nearest Neighbour (k-NN)
|
For each missing data point, the k-NN algorithm looks for the other k non-missing observations that are the most similar to the missing one, by comparing their distance measures. The missing data is then filled by a weighted average of the k neighbouring but non-missing observations, with the weights calculated based on their Euclidean distances to the missing data point. One difficulty in this method is the choice of k. In our study, we use the default number of k=10 nearest neighbours, but the choice of k can be more carefully tuned through cross-validation [18].
|
DMwR
(L Torgo, 2010) [19]
|
Difficult: Users are required to specify the parameter k.
|
6. Seasonal Decomposition
|
Seasonal decomposition is tailored to the handling of missing values in time series data and can be summarised in three steps. Firstly, it identifies and removes the seasonal component from the original time series. Next, the missing imputation is performed on the deseasonalized series. Finally, the seasonal component is added back to reflect seasonality [20].
|
ImputeTS
(S Moritz & T Bartz-Beielstein, 2017) [20]
|
Easy
|
In the implementation of methods 3, 4, and 5, we also included leads and lags with one time unit into the imputation, as recommended by [21] that including the time series’ own history and future can help predict the time point of interest.
In general, missForest and k-NN are considered as machine learning algorithms because they do not explicitly require the users to define how the prediction is taking place, whereas multiple imputation and seasonal decomposition require model specifications by the users.
Statistical Analysis
We evaluated the performance of the six imputation techniques mentioned above through the three most commonly used analytical methodologies in RHIS datasets as identified in the systematic review [11]. These methods are:
- Simple Linear Regressions;
- Segmented Regressions, which is the recommended technique to conduct ITS studies [21] and is widely used in evaluating health system quality improvement interventions when randomization is not possible [22];
- Parametric group comparisons through paired t-tests and non-parametric comparisons through paired Wilcoxon Rank-Sum tests, both of which are widely used in pre-post comparison studies.
Missing Data Mechanism
Before the imputation methods can be evaluated, it is important to first understand the missingness mechanism in the dataset. Missingness mechanisms are typically classified as (1) Missing Completely At Random (MCAR), where the probability of being missing is totally random and does not depend on the value of any variables; (2) Missing At Random (MAR), where the missing values in the variable may depend on the known values of other variables in the data but not on the missing variable itself; and (3) Missing Not At Random (MNAR), where the missingness of a variable could depend on the missing variable itself [24].
If data are believed to MNAR, it is generally recommended to improve the data quality by re-collecting data rather than using an imputation method because the missing pattern is not observed in the dataset [10,25]. On the other hand, if the data are believed to be MCAR, i.e., the probability of a data point being missing is totally random and independent from any of the other variables, then a complete case analysis in which missing values are simply removed would generate unbiased results in subsequent statistical analyses [26]. If, however, the data are believed to be MAR, i.e., the missing pattern can be fully identified using the observed data, some algorithms can be applied to impute the missing values, resulting in a new complete dataset with imputed values. This new complete dataset can then be used to conduct further analyses.
To simulate a scenario where the RHIS data were missing at random, we inserted missing values into an HMIS dataset consisting of 5,510 always-reporting facilities from the DRC’s HMIS as follows: the monthly total number of clinical visits at time and for facility was set to missing depending on the facility ’s location (city and province), facility type (one of Hospital, Health Post, or Health Centre), time (the number of months elapsed since January 2017), season (a four-level categorical variable: 1 for January to March, 2 for April to June, 3 for July to September, and 4 for October to December), log population, and a binary indicator of the COVID-19 pandemic (0 for January 2017 through February 2020, and 1 otherwise), through the following equation:
With this formula, we produced six datasets with 5%, 10%, 15%, 20%, 25%, or 30% of the monthly visits set to missing, respectively. Next, each of the six missing value imputation methods described above was used to impute missing values in each dataset, with imputation bias and Root Mean Square Error (RMSE) calculated to compare the imputed values with the true observed values, where and
Consecutive Missingness
Occasionally, facilities may consecutively miss making their monthly reports rather than following a pattern that would instead be considered MAR. Figure 1 summarizes the number of facilities with no missing monthly reports, with missing reports but no consecutive missing reports, exactly two consecutive missing reports, exactly three consecutive missing reports, and at least four consecutive missing reports, respectively. We observed that there was a considerable number of facilities with at least four consecutive missing reports (4,446 out of 18,138 facilities, or approximately 25%), which led us to also consider the performance of imputation methods in datasets with consecutive missing values. Specifically, we generated two additional datasets with 15% and 30% consecutive facility-month reports randomly set to missing.
Subsequent Analyses
After the imputed datasets were constructed, we then performed three types of analyses on each, i.e., simple linear regression, segmented regression analysis of ITS, and parametric and non-parametric tests for comparing groups. For simple linear regression, the facility-level monthly total number of clinical visits was used as the target variable and the incidence rate ratio (IRR) was estimated, with the following explanatory variables:
- Time – a discrete variable counting the number of months elapsed since January 2017;
- COVID – a binary variable indicating the presence of COVID-19 pandemic, i.e., 0 for January 2017 through February 2020, and 1 otherwise;
- Log Population – a continuous variable capturing the log-transformed population size of the health zone where the facility is located;
- Facility type – a categorical variable specifying the type of facility. Possible values are Hospital, Health Post, or Health Centre;
- Province – a categorical variable was added to account for the 26 provinces in the DRC;
- Season – a categorical variable to control for each of the 4 seasons as there are known to be seasonal variations in the use of health services in comparable settings [27].
For segmented regressions, we considered a facility-level mixed-effect segmented Poisson regression model, with the same target and explanatory variables as in the simple linear regression described above.
For pre-post comparisons, we conducted both parametric t-tests and non-parametric Wilcoxon Rank-Sum tests on each of the imputed datasets to examine if there were statistically significant differences in the mean number of clinical visits before vs. during the COVID-19 pandemic using paired t-tests and to examine if there was a location shift in the median number of clinical visits before vs. during the pandemic using paired Wilcoxon Rank-Sum tests. The number of monthly clinical visits was selected as a good overall measure of health services utilization in DRC, and this context was chosen because a decrease in the use of health service was found in Kinshasa, DRC following the onset of the COVID-19 pandemic [28].
Stability
RHIS databases are typically updated regularly. For example, in the DRC, there is a monthly update of the datasets that reflects the accretion of reports obtained from health facilities, including some that may have been submitted with a delay. These RHIS datasets are meant to be updated frequently, hence it is important to ensure consistency in the imputed values as well as the subsequent estimators obtained using these data from month to month. We therefore tested each imputation method’s stability to minimal changes in the dataset (i.e., with only two months of data removed). In particular, we designed a scenario where the last two months (September and October 2020) were removed as well as a scenario where two random months of data (i.e., two months chosen randomly from the entire dataset of 46 months with a ten-fold cross-validation) were removed. We compared the performance of each imputation method on the datasets generated under those two scenarios with its performance on the original dataset to evaluate the method’s stability. Besides that, we also repeated each imputation model on the original dataset but with another random starting point, as many machine learning optimization algorithms are found to be starting-point dependent [29], and thus the choice of starting point could potentially have an impact on the convergence and performance of the imputation method.