Of the 1459 unique articles retrieved from the database search, 132 studies met the inclusion criteria after full-text screening and thus were included in the review. The characteristics of these studies are presented in Table 1. Our review identified studies from 37 different countries. Three quarters of the studies were from Sub-Saharan African countries (74%), followed by South Asia (11%). The vast majority of the studies were published in the last decade, and more than half were published after 2014 (55%), suggesting an increase in the use of RHIS data for research purposes over time. Most of the studies included an analysis of RHIS data (97%), and a few used RHIS data to inform the study but did not describe analysis of RHIS data, such as one study that used information from RHIS data to justify for the selection of the indicator for individual level abstraction at facility. Among the studies that analyzed RHIS data, most of them utilized an ecological study design (79%). Of those, more than half included statistical inferences (61%), while the remaining studies only used RHIS data for descriptive purposes (31%). Nearly a fifth of the studies were mixed methods or case studies (18%), a third of which included statistical analyses of RHIS data (33%). A quarter of articles included a description of how they managed missing data (25%), while only a small number of studies described how they detected and dealt with extreme values (14%).
[Table 1 to be placed here.]
Types of disease and research purpose
Figure 2 shows the different research purposes for which RHIS data was used, along with the health topics investigated. The most common purpose of the studies was program evaluation (51%). RHIS data have been used to evaluate a wide range of interventions, ranging from programs that targeted specific diseases to interventions or policies that affected multiple types of diseases or health services. These included: effect of malaria control strategies, user fee exemption policies, health financing schemes, interventions on health governance, administration of new vaccines, and community-level interventions such as approaches to improve community participation and improve referrals from traditional birth attendants in increasing the demand for maternal and child care.
Additionally, RHIS data were used to monitor or assess service provision (23%) and to describe disease epidemiology (17%). Similar to program evaluation, these studies also investigated a diverse set of health services and the allocation of healthcare resources. Some of these studies found large discrepancy between RHIS data and an estimated disease burden in the population, and highlighted the lack of service provision. A few studies also used RHIS data to describe specific programs, conduct impact evaluations (non-programmatic), and estimate costs. Most of the studies investigated a communicable disease (95%), of which malaria was most studied health condition (24%). A few studies focused on mental health (3), diabetes mellitus (1), and permanent tooth extraction (1). Only two studies used RHIS data to research the health workforce or the equity of funding allocations29,30.
Analytic methods using RHIS data
Among articles that conducted statistical analyses using RHIS data (68), time series analyses to test or account for trends were most commonly performed (17), followed by geostatistical analysis (11), pre-post comparison (10), interrupted time series (7), and difference-in-difference analysis (5). Other longitudinal analyses (9), other cross-sectional analyses (8), and scenario analysis on cost effectiveness (1) were also conducted. Table 2 presents the range of methodologies identified across studies using RHIS data, as well as the corresponding articles.
[Table 2 to be placed here.]
Time series analysis
Time series analysis using RHIS data was most often applied to evaluate programs and identify disease epidemiology, with one study assessing the impact of an infectious disease outbreak on primary health service utilization 31. Studies analyzed indicators using large quantities of monthly or yearly data to estimate change (range of time units: 5 – 168). For instance, two-thirds of the studies analyzed three or more years of monthly data. Many of the studies utilized the highly disaggregated nature of the data by using either facility or district level data, with the exception of two studies which modelled national trends 32,33. Studies commonly applied strategies to account for temporal autocorrelation and the correlation between geographical units, including generalized linear models 34, multi-level analysis 35,36, and ordinary least-squares regression with adjustment for seasonality and lag 37–39. Among studies that modelled multiple facilities or administrative regions, random effects were commonly applied to account for heterogeneity.
In addition to RHIS data, a number of included studies incorporated data from external sources in their models based on geographical location such as district or region. Studies on malaria, for example, commonly included climate data from satellites in their models to control for important temporal factors, such as precipitation, humidity, and temperature 37,40. Other studies incorporated information from other national community surveys, health facility surveys, and program data as covariates 35,38. While most studies controlled for potential confounders by including covariates analytic models, one study on maternal health service applied propensity score matching to further remove biases from differences in covariate distribution 39.
Geostatistical analysis
Geostatistical analyses using RHIS data were predominantly conducted for epidemiological purposes and the monitoring and assessment of service provision by exploiting geospatial information included in the RHIS at the facility or district level. Three of the studies that applied geostatistical analysis were cross-sectional, while the remainder were spatial-temporal. About half of the studies focused on malaria, of which three compared and illustrated various kriging methods to provide a reliable estimate of malaria burden amid missing reporting 41–43, and one study applied geostatistical modeling to select the most relevant health facility indicators for severe malaria outcomes 44. Studies on other topics investigated the spatial or spatial-temporal dynamics of malaria in pregnancy 45, childhood diarrhea 46, clustering of malaria and HIV 47, and meningitis 48. About half of the studies did not include data from external sources, and others triangulated data sourced from satellite data, Demographic and Health Surveys, national Malaria Indicator Surveys, and Service Delivery Indicator Surveys in their analyses. Studies that included covariates in the geostatistical analysis applied Bayesian hierarchical Poisson model or Bayesian geostatistical negative binomial models 44,49,50.
Pre-post comparison analysis
Pre-post comparison was commonly applied among studies that used RHIS data for program evaluation, and several studies used simple descriptive statistics to compare the periods before and after interventions. As pre-post comparison is subject to the limitation of temporal confounders and secular trends, two of the studies included contextual factors in regression modelling 51,52.
Interrupted time series analysis
Most of the studies that conducted ITS analysis used it to evaluate interventions, and one assessed the impact of an infectious disease outbreak on maternal and child health service use 53. The studies used large quantities of monthly data to model trend and level change (range of time unit: 44 – 132). RHIS data was minimally aggregated in these studies, which mostly analyzed facility or district level data, and similar to studies using time series analysis, accounted for autocorrelation through incorporating autoregressive structures or clustered standard errors in their modelling.
As ITS analyses are generally unaffected by confounding variables that do not change over time by design 54, baseline characteristics were typically not included in these models. Nonetheless, ITS analyses can be affected by time-varying confounders that rapidly change and some models included contextual factors from other data sources, such as climate and program data. To strengthen the quasi-experimental design, two studies also included a contrast group of time series to control for contextual changes that occurred at the same time as the interventions 55,56.
Difference-in-difference analysis
Five studies applied difference-in-difference techniques using a wide range of time periods (range of time units: 4 – 48) and level of geographical units (facility, district, provincial). Only one study included contextual characteristics from other data sources in its analysis. Analytic methods varied from descriptive comparison between and within intervention and control groups 57–60, to ordinary least square regression with propensity score matching 61.
Impact of research using RHIS data
Most of the studies that conducted statistical analyses using RHIS data were published in journals with impact factors (88%, figure 3), two-thirds of which were two or higher, and more than a fifth of which were more than three. The studies published in journal with the highest impact factors focused on program evaluation (8), monitoring and assessment of service provision (3), epidemiology (3) and impact evaluation (1). These studies encompassed a range of health topics commonly studied using RHIS data.
Strategies to circumvent RHIS data quality issues
Data quality is commonly cited as a barrier to using RHIS data in research, and slightly more than a quarter of the included studies described the strategies that they used to handle missing data and/or identify extreme values (table 3), which included exclusion, imputation, interpolation, verification, and accounting for missing data in modeling. Exclusion of missing data was the most common practice, and among studies that used this technique, they excluded facilities from the analytic samples 55,56,68,69,57,60,62–67, restricted the study period based on explicit criteria 70,71, or applied sensitivity analysis to compare various exclusion criteria 60,72,73. Imputation methods varied from assigning specific values to the missing observation 48,57,61,74–76, to various modeling strategies such as conditional autoregressive model 49, generalized linear regression 75, and iterative singular value decomposition 75. Sensitivity analysis was also conducted to select a specific imputation strategy 75. Interpolation involves predicting values at unsampled locations. Methods described included the use of space-time kriging 41–43, and the adjustment of results by calibrating with other relevant information 64,77,78. Some studies assumed data were missing at random, which was accounted for in specific modeling methods such as mixed-effect models 63,75. When the source of data could be reached, studies also described verifying the missing information using registries where the original data was recorded 40,71,79–81.
[Table 3 to be placed here.]
Slightly fewer articles described methods to identify and handle extreme values in the RHIS data, of which three types of strategies emerged: setting specific thresholds, visual inspection, and analytic assessment. Thresholds were set based on the distribution of the data, such as proportions or standard deviations from univariate regression. Several studies used visual inspection of outliers 43,55, while the use of jackknifing analysis and the identification of influential points through Cook’s distance statistics were also applied 82,83. Upon identification of extreme values, several strategies were utilized: exclusion, replacement with the average value, replacement with the missing value, verification with a data source, or discounting the observation in statistical estimation. However, studies that replaced the extreme value with an explicit value potentially introduced bias into their estimates. A few studies also described the processes applied to assess the reliability of the RHIS data, some of which were routine processes administered in the health systems 80,81.