Data Cleaning Process for HIV Indicator Data Extracted from DHIS2 National Reporting System: Case Example of Kenya

doi:10.21203/rs.3.rs-21675/v1

Download PDF

Research article

Data Cleaning Process for HIV Indicator Data Extracted from DHIS2 National Reporting System: Case Example of Kenya

https://doi.org/10.21203/rs.3.rs-21675/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 13 Nov, 2020

Read the published version in BMC Medical Informatics and Decision Making →

You are reading this older preprint version

Read the latest preprint version →

Background

The District Health Information Software 2 (DHIS2) is widely used by countries for national-level aggregate reporting of health data. To best leverage DHIS2 data for decision-making, countries need to ensure that data within their systems are of the highest quality. Comprehensive, systematic and transparent data cleaning approaches form a core component of preparing DHIS2 data for use. Unfortunately, there is paucity of exhaustive and systematic descriptions of data cleaning processes employed on DHIS2-based data. In this paper, we describe results of systematic data cleaning approach applied on a national-level DHIS2 instance, using Kenya as the case example.

Methods

Broeck et al’s framework, involving repeated cycles of a three-phase process (data screening, data diagnosis and data treatment), was employed on six HIV indicator reports collected monthly from all care facilities in Kenya from 2011 to 2018. This resulted to repeated facility reporting instances. Quality dimensions evaluated included reporting rate, reporting timeliness, and indicator completeness of submitted reports each done per facility per year. The various error types were categorized, and Friedman analyses of variance conducted to examine differences in distribution of facilities by error types. Data cleaning was done during the treatment phases.

Results

A generic five-step data cleaning sequence was developed and applied in cleaning HIV indicator data reports extracted from DHIS2. Initially, 93,179 facility reporting instances were extracted from year 2011 to 2018. 50.23% of these instances submitted no reports and were removed. Of the remaining reporting instances, there was over reporting in 0.03%. Quality issues related to timeliness included scenarios where reports were empty or had data but were never on time. Percentage of reporting instances in these scenarios varied by reporting type. Of submitted reports empty reports also varied by report type and ranged from 1.32–18.04%. Report quality varied significantly by facility distribution (p = 0.00) and report type.

Conclusions

The case instance of Kenya reveals significant data quality issues for HIV reported data that were not detected by the inbuilt error detection procedures within DHIS2. More robust and systematic data cleaning processes should be integrated to current DHIS2 implementations to ensure highest quality data.

Medical Informatics

data-cleaning

dhis2

HIV indicators

data management

Monitoring and Evaluation (M&E) plays a key role in planning of any national health program. DeLay et al. defined M&E as “acquiring, analysing and making use of relevant, accurate, timely and affordable information from multiple sources for the purpose of program improvement(1, 2).” With good M&E systems, data can be used to provide information and generate knowledge to monitor delivery of care, support reporting, in healthcare planning, for measure progress and to improve accountability, among others(3). However, these goals can only be achieved if the data collected is of high quality.

To help with M&E in low- and middle-income countries (LMICs), reporting indicators have been highly advocated for use across many disease domains, with HIV indicators among the most common ones reported to national-level facilities in many countries. Over the years, national-level data aggregation systems, such as the District Health Information Software 2 (DHIS2)(4), have been widely adopted for use in collecting, aggregating and analyzing indicator data. DHIS2 has been implemented in over 40 LMICs with the health indicator data reported within the system used for national- and regional-level health-related decision-making, advocacy, and M&E (5).

It is well-recognized that the data within aggregate systems, such as DHIS2, are only as good as their quality. As such, various approaches have been implemented within systems like DHIS2 to improve data quality. Some of these approaches include: (a) validation during data entry in order to ensure data are captured using the right formats and within pre-defined ranges and constraint; (b) user-defined validation rules; (c) automated outlier analysis functions; and (d) automated calculations and reporting of data coverage and completeness (6).

Despite data quality approaches having been implemented within DHIS2, data quality issues remain a thorny problem. A number of evaluations have looked at how well the existing data quality approaches actually ensure that the data contained within the systems are of the highest quality for use in decision-making and analysis(7–10).Nevertheless, these studies largely fail to exhaustively and systematically describe the steps used in data cleaning of the DHIS2 data before analysis is done. Ideally, data cleaning should be done systematically, and good data cleaning practice requires transparency and proper documentation of all procedures taken to clean the data (11, 12). A closer and systematic look into data cleaning approaches, and a clear outlining of the distribution or characteristics of data quality issues encountered in DHIS2 could be instructive in informing approaches to further ensure higher quality data for decision-making. Further, employment of additional data cleaning steps will ensure that good quality data is available from the widely deployed DHIS2 system for use in accurate decision making and knowledge generation.

In this paper, we report on application of systematic and replicable data cleaning approaches that have specific applicability to the broadly implemented DHIS2 national reporting system. We also present the distribution of data quality issues encountered during the cleaning process of DHIS2, using Kenya as a reference country case. Our approach is guided by a conceptual data-cleaning framework, with a focus on uncovering data quality issues often missed by existing automated approaches. From our evaluation, we provide recommendations on extracting and cleaning data for analysis from DHIS2, which can be used by M&E teams within Ministries of Health in LMICs and by researchers to ensure high quality data for analysis and decision-making.

Data cleaning approaches

Data cleaning is defined as “the process used to determine inaccurate, incomplete, or unreasonable data and then improving the quality through correction of detected errors and omissions” (13). Data cleaning is essential to transform raw data into quality data for purposes such as analysis and data mining (14). An extensive body of work exists on how to clean data (15–17). Some of the approaches that can be employed include quantitative or qualitative methods. Quantitative approaches employ statistical methods, and are largely used to detect outliers (18–20). On the other hand, qualitative techniques use patterns, constraints, and rules to detect errors (21). These approaches can be applied within automated data cleaning tools such as ARKTOS, AJAX, FraQL, Potter’s Wheel and IntelliClean (17, 21, 22).

Data Cleaning Framework

While tools and approaches exist for data cleaning, no standard consensus-based approach exists to ensure that replicable and rigorous data cleaning standards are applied on DHIS2 data that is widely used by countries for health decision-making. Consequently ad hoc data cleaning approaches have been employed, with failure by implementations to explicitly disclose the systematic data cleaning strategies used and the resulting errors identified. This makes it difficult to replicate data cleaning procedures and to ensure that all types of quality issues are systematically addressed prior to use of data for analysis and decision-making.

There are a limited number of frameworks that exist to guide error detection within data sets, and these can be adapted in recommending a systematic approach for cleaning DHIS2 data. Oftentimes, specific frameworks are applied based on the data set and the aims of the cleaning exercise (23, 24). Our study’s data cleaning approach was informed by a conceptual data-cleaning framework recommended by Broeck et al.(11). Broeck et al’s framework was used because it provides a deliberate and systematic data cleaning guideline that is amenable to being tailored towards cleaning data extracted from DHIS2. This framework presents data cleaning as a three-phase process involving repeated cycles of data screening, data diagnosis, and data editing of suspected data abnormalities. The screening process involves identification of lacking or excess data, outliers and inconsistencies and strange patterns(11). Diagnosis involves determination of errors or missing data and any true extremes and true normals(11). Editing involves correction or deleting of any identified errors(11). Broeck et al’s framework has also been extensively applied and validated in various settings(25, 26).

Study Setting

This study was conducted in Kenya, a country in East Africa. Kenya adopted DHIS2 for use for its national reporting in 2011 (4). The country has 47 administrative counties and all the counties report a range of healthcare indicator data from care facilities and settings into the DHIS2 system. For the purposes of this study, we focused specifically on HIV indicator data reported within Kenya’s DHIS2 system, given that these are the most comprehensively reported set of indicators into the system.

Data Cleaning Process

Adapting the Broeck et al’s framework, a step-by-step approach was used during extraction and cleaning of the data from DHIS2. These steps are generic and can be replicated by others conducting robust data cleaning on DHIS2. These steps are outlined below:

Step 1 - Outline the analyses or evaluation questions: Prior to applying the Broeck et al’s conceptual framework, it is important to identify the exact evaluations or analyses to be conducted, as this helps define the data cleaning exercise.

ii.

Step 2 - Description of data and study variables: This step is important for defining the needed data elements that will be used for the evaluation data set.

iii.

Step 3 - Create the database: This step involves identifying the data needed and extracting data from relevant databases to generate the final data set. Oftentimes, development of this database might require combining data from different sources.

iv.

Step 4 - Apply the framework for data cleaning: During this step, the three data cleaning phases (screening, diagnosis, and treatment) in Broeck et al’s framework are applied on the data set created.

Step 5 - Analyze the data: This step provides a summary of the data quality issues discovered, the eliminated data after the treatment exercise, and the retained final data set on which analyses can then be done.

Application of data cleaning process: Kenya HIV indicator reporting case example

In this section, we present the application of the data cleaning sequence above using Kenya as case example.

Step 1: Outline the analyses or evaluation questions and goals

For this reference case, DHIS2 data had to undergo the data cleaning process prior to use of the data for an evaluation question on ‘Performance of health care facilities at meeting the mandated HIV-indicator reporting requirements by the Kenyan Ministry of Health (MOH)’. The goal was to identify the best performing and poorest health facilities at reporting within the country, using the completeness and timeliness of their reports into DHIS2.

Step 2: Description Of Data And Study Variables

HIV indicator data in Kenya are reported into DHIS2 on a monthly basis using the MOH-mandated form called “MOH 731- Comprehensive HIV/AIDS Facility Reporting Form” (MOH 731). As of 2011 to 2018, MOH 731 consisted of six independent reports representing six programmatic areas in which HIV indicators were reported. The six reports and the number of indicators reported in each include: (i) HIV counselling and testing (HCT) – 10 indicators; (ii) Prevention of Mother-to-Child transmission (PMTCT) – 40 indicators; (iii) Care and Treatment (CRT) – 65 indicators; (iv) Voluntary Medical Male Circumcision (VMMC) – 13 indicators; (v) Post-Exposure Prophylaxis (PEP) – 14 indicators; and (vi) Blood Safety (BS) – 3 indicators. Each facility is expected to submit between 0 to 6 reports every month based on the type(s) of services offered by that facility. Monthly due date for all reports are defined by the Ministry of Health (MOH), and the information on the expected number of reports per facility.

For our use case, we wanted to create a data set to determine performance of facilities at meeting the MOH reporting requirements by evaluating completeness and timeliness of reporting. Completeness in reporting by facilities within Kenya’s DHIS2 is measured as a continuous variable starting at 0–100% and identified within the system by a variable called ‘Reporting Rate (RR). RR is calculated automatically within DHIS2 as a percentage of the actual number of reports submitted by each facility into DHIS2 divided by the expected number of reports from the facility (Percent RR = # submitted reports / expected # of reports * 100). It should be noted that this RR calculation only looks at report submission and not the content within the reports. As such, a report may be submitted as blank or have missing indicators, but will be counted as complete simply because it was submitted. At the end of each year, DHIS2 calculates the cumulative RR for the whole year. Timeliness is calculated based on whether the reports were submitted by the 15th day of the reporting period as set by the MOH. Timeliness is represented in DHIS2 as ‘Reporting Rate on Time (RRT)’, and is also calculated automatically. The RRT for a facility is measured as a percentage of the actual number of reports submitted on time by the facility divided by the expected number of reports (Percent RRT = # reports submitted on time / expected # of reports * 100).

Step 3: Create The Database

After obtaining Institutional Review Board (IRB) approval for this work, we set out to create our database from three data sources as outlined below:

(1) Data Extracted from DHIS2: We extracted variables from DHIS2 for all HIV reports submitted from all facilities in all 47 counties in Kenya between the years 2011 and 2018, with variables grouped by year. Variables extracted from DHIS2 by year included: facility, programmatic area or report (e.g. Blood Safety), expected number of reports, actual number of submitted reports, actual number of reports submitted on time, cumulative Reporting Rate (RR) by year (calculated automatically by DHIS2) and cumulative RRT by year (calculated automatically by DHIS2). We also extracted the individual indicator data submitted within each report by the health facilities for all the six programmatic areas for every year under evaluation.

There has been an increase in registered facilities within DHIS2 from 2011 to 2018. In addition, extracting the above data from 2011 to 2018 resulted to repeated occurrence of the facility variable in the different years. For example, facilities registered in DHIS2 in 2011 will appear in subsequent years resulting to eight occurrences within the 8 years (2011 to 2018). In this study, these repeated occurrences of facilities are referred to as ‘reporting instances’.

(2) Facility Information: We augmented the DHIS2 data with detailed facility information derived from Kenya Master Facility List (KMFL). This information included facility level (II-VI), facility type (such as dispensary, health center, medical clinic) and facility ownership (such as private practice, MOH-owned, owned by a non-governmental organization).

(3) Electronic Medical Record Status: We used the Kenya Health Information Systems (KeHIMS) list, which contains electronic medical records (EMR) implemented in health facilities in Kenya, to incorporate information on whether the facility had an EMR or not. Information from these three sources were merged into a single data set as outlined in Fig. 1.

Step 4: Application Of The Framework For Data Cleaning

Figure 2 outlines the iterative cleaning process we applied adapting Broeck et al’s framework. Data cleaning involved repeated cycles of screening, diagnosis, and treatment of suspected data abnormalities, with each cycle resulting in a new data set. Details of the data cleaning process is outlined in Figure 2.

A) Screening Phase

During the screening phase, five types of oddities need to be distinguished, namely: lack or excess of data; outlier (including inconsistencies); erroneous inliers; strange patterns in distributions and unexpected analysis results (11). For determining errors, we used RR and RRT as key evaluation variables. RR by itself only gives a sense of the proportion of expected reports submitted, but does not evaluate whether exact indicators are included within each report. To evaluate completion of report of each indicator with the reports that were submitted, we created a new variable named ‘Cumulative Percent Completion (CPC)’. CPC provides an aggregate annual summary of the proportion of expected indicator values that are completed within submitted reports.

B) Diagnostic Phase

The diagnostic phase enables clarification of the true nature of the worrisome data points, patterns, and statistics. Broeck et al. posits possible diagnoses for each data point as: erroneous, true extreme, true normal or idiopathic (no diagnosis found, but data still suspected to having errors) (11). We used a combination of RR, RTT and CPC to detect various types of situations (errors or no errors) for each facility per annual report (Table 1). Using the combination of CPC, RR, RRT we were able to categorize the various types of situations to be used in diagnosis for every year a facility reported into DHIS2 (Table 1). In this table, “0” represents a situation where percentage is zero; “X” represents a situation where percentage is above zero; and “>100%” represents a situation where percentage is more than 100. Based on the values per each of the three variables, it was possible to diagnose the various issues within DHIS2 (Diagnosis Column).

Table 1

Categorization of the various situations within DHIS2 and actions taken
Situation	CPC^a	RR^b	RRT^c	Diagnosis	Action
A	0	0	0	Nothing was reported by facilities during this period, signifying that the facility does not report to DHIS2. This could be a true normal.	Excluded
B	0	X	X	Submitted reports might be on time, but are empty. Can result from programs wanting to have full MOH731 submission even though they do not offer services in all the 6 programmatic areas– hence submitting empty reports from non-required programmatic areas. (Report is useless to decision-maker as it is empty)	Excluded
C	0	X	0	Submitted reports are empty and not on time. (Report is useless to decision-maker as it is empty and not on time)	Excluded
D	X	0	0	No values present for RR and RRT. However, the reports are not empty	Excluded
E	X	> 100%	X	Over-reporting (outliers)	Excluded
F	X	> 100%	> 100%	Over-reporting (outliers)	Excluded
G	X	X	X	Reports submitted on time with relevant indicators included. Ideal situation	Included
H	X	X	0	Submitted reports with data elements in them, but not submitted in a timely manner	Included
^aCPC – Cumulative Percent Completion
^bRR – Reporting Rate
^cRRT – Reporting Rate on Time
For each type of report (e.g. HCT) we categorized facilities by year and variable. All health facilities with an average CPC, RR and RRT of zero (0) across all reports were identified as not having reported for the year and were henceforth excluded – as demonstrated by examples of Facility A and B in Table 2.

Table 2

Sectional illustration of first data set
Year	Organisation unit	CPC-HCT	RR-HCT	RRT-HCT	CPC-BS	RR-BS	RRT-BS	AVG-CPC	AVG-RR	AVG-RRT
2016	Facility A	0	0	0	0	0	0	0	0	0
2016	Facility B	0	0	0	0	0	0	0	0	0
2017	Facility C	10	90	80	100	90	80	50	60	50

Beyond categorization of the various situations by report type, facility and year as defined above, errors related to duplicates were also identified using two scenarios. The first scenario of duplicates included a situation where health facilities had similar attributes such as year, name and county, with different data for RR and RRT. The second scenario of duplicates involves a situation where health facilities had similar attributes such as year, name and county, with similar data for RR, and RRT.

C) Treatment Phase

This is the final stage after screening and diagnosis, and entails deciding on the action point of the problematic instances identified. Broeck et al. limits the action points to correcting, deleting or leaving unchanged. Based on the diagnosis illustrated in Table 1, reports in situation A-F were deleted hence excluded from the study. Duplicates identified in the scenarios mentioned were also excluded from the study. Thus, only reports in situation G and H were considered ideal for the final clean data set.

Step 5: Data Analysis

The data was then disaggregated to form six individual data sets representing each of the programmatic areas containing the facility and year. The disaggregation was because facilities offer different services and do not necessarily report in all the programmatic areas. SPSS was used to analyze the data using frequency distributions and cross tabulations in order to screen for duplication and outliers. Individual health facilities with frequencies of more than eight reporting instances for a specific report type (data set) were identified as duplicates. The basis for this is that the maximum reporting instances for an individual health facility has to be eight, given that data was extracted within an eight-year period. From the cross tabulations, percentage RR and RRT that were above 100% were identified as outliers.

After the multiple iterations of data cleaning as per Fig. 2, where erroneous data were removed by situation type (identified in Table 1), a final clean data set was available and brought forward for use in answering the evaluation question. At the end of the data cleaning exercise, we determined the percentage distribution of the various situation types that resulted in the final data set. Using this analysis and descriptions from Table 1, we selected situations with errors (D), and empty reports (B), in order to determine if there is a difference in distribution of facilities within this situations in the six programmatic areas. As such, only data sets disaggregated into the six programmatic areas were included in the analysis.

This will enable comparing distribution of facilities (submitting reports in each programmatic area) categorized by error and empty reports. The data contains related samples, and is not normally distributed. Therefore, a Friedman analysis of variance (ANOVA) was conducted to examine if there is a difference in distribution of facilities by programmatic area across all years N = 8 (2011 to 2018) for the selected situation types. The dependent variable is the distribution of facilities. The distribution of facilities will be measured in all the six programmatic areas across the eight years and categorized by situation type. Wilcoxon Signed Rank Test were carried out as post hoc tests to compare significances in facility distribution within the programmatic areas.

Below, we report on findings from the iterative data cleaning exercise and the resulting clean data set used to answer the evaluation question on performance of health care facilities at meeting the mandated HIV-indicator reporting requirements. The results further illustrate the value of the data cleaning exercise.

Figure 3 reports the various facility reporting instances at each cycle of the data cleaning process and the number (proportion) of excluded facility reporting instances representing data with errors at each cycle.

The proportion of the resultant observation after removal of the various types of errors from the original instances is represented in Table 3. A breakdown of reporting by facilities in descending order based on facility reporting instances retained after cleaning in data set 4 is as follows; 93.98% were retained for HCT, 80.98% for PMTCT, 43.79% for CRT, 22.10% for PEP, 0.66% for VMMC, and 0.45% for BS.

Table 3

Proportion of facility reporting instances by programmatic area in the various situations based on data set 4.
	Facility reporting instances
Situation^d	HCT (%)	PMTC (%)	CRT (%)	VMMC (%)	PEP (%)	BS (%)
B(0XX)	2.68	6.15	1.32	2.81	18.04	1.70
C(0 × 0)	0.75	0.75	0.32	1.12	0.76	0.16
D(X00)	0.66	1.97	1.66	0.78	0.71	0.09
G(XXX)	92.44	81.52	42.60	0.63	21.82	0.44
H(XX0)	1.57	2.12	1.20	0.03	0.28	0.01
Duplicates	0.02	0.00	0.01	0.00	0.00	0.00
Total reporting instances (based on data set 4)	100.00	100.00	100.00	100.00	100.00	100.00
Total reporting instances removed	6.02	19.02	56.21	99.34	77.90	99.52
Total reporting instances retained	93.98	80.98	43.79	0.66	22.10	0.48
Situation^d-Detailed explanation of the various reporting situations within DHIS2 can be found in Table 1.

Situations where data was present in reports but no values present for RR and RRT (Situation D); and scenarios with empty reports (Situation B) were analyzed (Fig. 4). This was in order to examine whether there are differences in distribution of facilities by programmatic area across the eight years, categorized by situation D and empty reports. Facilities were most likely to submit PEP empty reports (18.04%) as shown in Fig. 4.

Overall Friedman Tests results for distribution of facilities with situation D (X00) and empty reports (0XX) in the various programmatic areas reveal statistically significant differences in facility distribution (p = 0.00) across the eight years. Specific mean rank results categorized by error type are described in subsequent paragraphs.

Friedman Tests results for distribution of facilities with situation D (X00) reveal that PMTCT and CRT had the highest mean rank of 5.88 and 5.13 respectively compared to the other reports CT (3.00), VMMC (3.06), PEP (2.88) and BS (1.06). Post hoc tests presented in Table 4 reveal that PMTCT and CRT had higher distribution of facilities with D (X00) issues on their reports in all the eight years

Table 4

Results for Wilcoxon Signed Rank Test for distribution of facilities with situation D (X00)
Situation D (X00)
Pairwise comparison by Reports	Wilcoxon Signed Ranks Test (P value)	Wilcoxon Signed Ranks Test (Z value)	Distribution of facilities with situation X based on pairwise comparison by Reports
PMTCT - HCT	0.012	-2.521	Higher in PMTCT for 8 years
CRT - HCT	0.012	-2.521	Higher in CRT for 8 years
BS - HCT	0.012	-2.524	Lower in BS for 8 years
VMMC - PMTCT	0.012	-2.521	Lower in VMMC for 8 years
PEP - PMTCT	0.012	-2.521	Lower in PEP for 8 years
BS - PMTCT	0.012	-2.521	Lower in BS for 8 years
VMMC - CRT	0.012	-2.524	Lower in VMMC for 8 years
PEP - CRT	0.012	-2.527	Lower in PEP for 8 years
BS - CRT	0.012	-2.524	Lower in BS for 8 years
BS - VMMC	0.018	-2.375	Lower in BS for 8 years
BS - PEP	0.012	-2.524	Lower in BS for 8 years
PMTCT^h –Prevention of Mother to Child Transmission, HCT^f- HIV Counselling and Testing,, CRTⁱ-Care and Treatment, PEP^j-Post-exposure prophylaxis ,BS^g-Blood Safety, VMMC^e -Voluntary Medical Male Circumcision
Friedman Tests results for empty reports reveal that PEP had the highest mean rank of 6.00 compared to the other reports CT (3.50), PMTCT (4.88) CRT (2.00), VMMC (3.00), PEP and BS (1.63). Post hoc tests presented in Table 5 also reveal that PEP had higher distribution of facilities with empty reports B (0XX) in all the eight years.

Table 5

Results for Wilcoxon Signed Rank Test for distribution of facilities with Empty Reports B (0XX)
Empty reports B (0XX)
Pairwise comparison by Reports	Wilcoxon Signed Ranks Test (P value)	Wilcoxon Signed Ranks Test (Z value)	Distribution of facilities with error based on pairwise comparison by Reports
PMTCT - HCT	0.012	-2.521	Higher in PMTCT for 8 years
CRT - HCT	0.036	-2.100	Lower in CRT for 6 years
PEP - HCT	0.012	-2.521	Higher in PEP for 8 years
BS - HCT	0.012	-2.524	Lower in BS for 8 years
CRT - PMTCT	0.017	-2.521	Lower in CRT for 7 years
VMMC - PMTCT	0.012	-2.521	Lower in VMMC for 8 years
PEP - PMTCT	0.012	-2.521	Higher in PEP for 8 years
BS - PMTCT	0.012	-2.524	Lower in BS for 8 years
VMMC - CRT	0.050	-1.960	Higher in VMMC for 6 years
PEP - CRT	0.012	-2.521	Higher in PEP for 8 years
PEP - VMMC	0.012	-2.521	Higher in PEP for 8 years
BS - VMMC	0.012	-2.524	Lower in BS for 8 years
BS - PEP	0.012	-2.521	Lower in BS for 8 Years
PMTCT^h –Prevention of Mother to Child Transmission, HCT^f- HIV Counselling and Testing,, PEP^j-Post-exposure prophylaxis ,BS^g-Blood Saftey, CRTⁱ-Care and Treatment, VMMC^e -Voluntary Medical Male Circumcision,

Data quality problems due to replicated entries, missing information or other invalid data are common in integrated data sources such as data warehouses and national-aggregated systems(27). As such, data in these systems cannot be used in their current form for decision-making. Failure to detect data quality issues and to clean these data can lead to inaccurate analyses outcomes, which impact decision-making. Therefore, the need for data cleaning increases significantly, when multiple sources need to be integrated. DHIS2, which acts as a data warehouse for storing routine aggregate data submitted by multiple health facilities is not immune to data quality issues as revealed in this paper. Despite the fact that various approaches have been implemented within systems like DHIS2 to improve data quality, there is still need for systematic and more robust data cleaning mechanisms.

Systematic data cleaning approaches are salient in identifying and sorting issues within the data resulting to a clean data set that can be used for analysis and decision-making. In addition, identifying various issues within the data may require a human-driven approach as inbuilt data quality checking mechanisms within systems may not have the benefit of a particular knowledge. For instance, our knowledge about health facility reporting enabled us to identify the various situations described in Table 1. This entailed examining more than one column at a time of manually integrated databases. In addition, descriptive statistics such as use of cross tabulations and frequency counts complemented the human-driven processes.

As revealed in the screening, diagnosis and treatment phases presented in this paper, data cleaning process can be more time consuming than the analysis process. Real-world data such as the DHIS2 data and merging of real world data sets as shown in this paper may be noisy, inconsistent and incomplete. In the treatment stage, we present the actions taken to ensure that only meaningful data is included for analysis. Data cleaning also resulted to a smaller data set than the original as demonstrated in the results. A perceived belief is that more data could lead to more results, nevertheless meaningful data that comes about through the process of cleaning is beneficial for decision-making(14).

In this paper, we used Broeck et al’s framework to identify various issues within the data such as: duplicate records, data present in reports but no values present for RR and RRT(Situation D), and over-reporting (Situation E and F)(11). Non-parametric tests conducted (Friedman ANOVA and Wilcoxon Signed Ranked Test), brought to perspective significant differences in distribution of facilities with selected situation types across different periods. As observed, PMTCT and CRT (mean rank of 5.88 and 5.13 respectively) had the highest distribution of facilities with situation D (X00), while PEP (mean rank 6.00) had the highest distribution of facilities with empty reports (B (0XX)). As such, for facilities submitting reports, distribution of facilities with the selected situation types varied significantly by the type of report submitted. In addition, a shortfall in DHIS2 is lack of recording the actual zero (0) in the reports, which may give an impression of incomplete reporting of indicators as the values are blank. This is similar to observations in other studies(10), hence making it difficult to distinguish between missing values and true zero values. There is therefore need to ensure that cases reported as zero appear in DHIS2.

The human augmented processes used in this study facilitated diagnosis of the different situations, which would have gone unidentified. Quantitative techniques presented by Hellerstein (19) further stimulate the need of human involvement as some techniques, for instance those used in outlier detection, may fail to flag outliers and instead “masking” them based on underlying automatic procedures engrained in the system. Nevertheless, there are also limitations with human augmented procedures as human is to error especially when dealing with extremely large data sets. In addition data cleaning for large data sets can also be time consuming. Nonetheless, identifying and understanding issues within the data using a human-driven approach provides better perspective prior to developing automatic procedures, which can then detect the identified issues.

Therefore, there is need for developing automated procedures that can identify the various situations addressed in this paper. For example, various approaches can be developed for purposes of detecting the different situation types in Table 1, such as implementing validation rules, univariate outlier detection (discovering issues within the data by examining one column at a time) and multivariate outlier detection (discovering issues within the data by examining more than one column at a time) accompanied by data visualizations. Further still automated analytic procedures can be developed within the system to perform various analyses such as calculating the number of empty reports submitted by a facility for a sought period of time. This could provide beneficial practical implications such as enabling decision-makers to understand the frequency of provision of certain services among the six programmatic areas within a particular period among health facilities. Such findings could be used to improve the quality of reporting. Automatic procedures should also be accompanied by data visualizations, and analyses, integrated within the iterative process in order to provide insights (19). In addition, user engagement in development of automatic procedures and actively training users in identifying and discovering various issues within the data may contribute to better quality of data (19),(21).

Comprehensive, transparent and systematic reporting of cleaning process is important for validity of the research studies. The data cleaning included in this article was semi-automatic. It complemented the automatic procedures and resulted in improved data quality, which could not be secured by the automated procedures solemnly. In addition, this was the first systematic attempt to explore data quality and design data cleaning procedures of HIV indicator data reporting in DHIS2.

Blood Safety

CPC

Cumulative Percentage Completion

CRT

Care and Treatment

DHIS2

District Health Information System Version 2

EMR

Electronic Medical Record

HIV

Human Immunodeficiency Virus

HCT

HIV Counselling and Testing

KeHMS

Kenya Health Management System

KMFL

Kenya Master Facility List

LMICs

Low and Middle Income Countries

MOH

Ministry of Health

NGO

Non-Governmental Organization

PEP

Post-Exposure Prophylaxis

PMTCT

Prevention of Mother to Child Transmission

Reporting Rate

RRT

Reporting Rate on Time

VMMC

Voluntary Medical Male Circumcision

Ethics approval

Ethical approval for this study was obtained from the Institutional Review and Ethics Committee (IREC) in Moi University-No. 0003362.

Consent for publication

Not applicable

Availability of data and materials

The data sets generated and analysed during the current study are available in the national District Health Information Software 2 online database at https://hiskenya.org/.

Disclaimer

The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Ministry of Health in Kenya.

Competing interests

The authors declare that they have no competing interests.

Funding

This work was supported in part by the NORHED program (Norad: Project QZA-0484). The content is solely the responsibility of the authors and does not represent the official views of the Norwegian Agency for Development Cooperation.

Authors' contributions

M.G, A.B, and M.W designed the study. A.B and M.W supervised the study. M.G and A.B analyzed the data. All authors discussed the results, reviewed, and approved the final manuscript. M.G. wrote the final manuscript.

Acknowledgments

Not applicable

Lay PR, De N, Massoud DLR, Carae KAS and M. Strategic information for HIV programmes. In: The HIV Pandemic: Local and Global Implications. Oxford Scholarship Online; 2007. p. 146.
Beck EJ, Mays N, Whiteside A, Zuniga JM. The HIV Pandemic: Local and Global Implications. The HIV Pandemic: Local and Global Implications. Oxford University Press; 2009. p. 1–840.
Porter LE, Bouey PD, Curtis S, Hochgesang M, Idele P, Jefferson B, et al. Beyond Indicators. JAIDS J Acquir Immune Defic Syndr. 2012;60:120–6.
Karuri J, Waiganjo P, Orwa D, Manya A. DHIS2: The Tool to Improve Health Data Demand and Use in Kenya. J Health Inform Dev Ctries. 2014;8:38–60.
Dehnavieh R, Haghdoost AA, Khosravi A, Hoseinabadi F, Rahimi H, Poursheikhali A, et al. The District Health Information System (DHIS2): A literature review and meta-synthesis of its strengths and operational challenges based on the experiences of 11 countries. Health Inf Manag. 2019;48:62–75.
Dhis2 Documentation Team. Control data quality. In: DHIS2 User Manual. 2020. . Accessed 26 Mar 2020.
Githinji S, Oyando R, Malinga J, Ejersa W, Soti D, Rono J, et al. Completeness of malaria indicator data reporting via the District Health Information Software 2 in Kenya, 2011–2015. BMC Malar J. 2017;16:1–11.
Adokiya MN, Awoonor-Williams JK, Beiersmann C, Müller O. Evaluation of the reporting completeness and timeliness of the integrated disease surveillance and response system in northern Ghana. Ghana Med J. 2016;50:3–8.
Mate KS, Bennett B, Mphatswe W, Barker P, Rollins N. Challenges for Routine Health System Data Management in a Large Public Programme to Prevent Mother-to-Child HIV Transmission in South Africa. Castro A, editor. PLoS One. 2009;4:e5483.
Bhattacharya AA, Umar N, Audu A, Allen E, Schellenberg JRM, Marchant T. Quality of routine facility data for monitoring priority maternal and newborn indicators in DHIS2: A case study from Gombe State, Nigeria. Bazzano AN, editor. PLoS One. 2019;14:e0211265.
Van Den Broeck J, Cunningham SA, Eeckels R, Herbst K. Data cleaning: Detecting, diagnosing, and editing data abnormalities. PLoS Med. 2005;2:e267.
Leahey E, Entwisle B, Einaudi P. Diversity in everyday research practice: The case of data editing. Sociol Methods Res. 2003;32:64–89.
Chapman AD. Principles and Methods of Data Cleaning Primary Species Data. 1st ed. Report for the Global Biodiversity Information Facility. GBIF; 2005. p. 72.
Zhang S, Zhang C, Yang Q. Data preparation for data mining. Appl Artif Intell. 2003;17:375–81.
Oliveira P, Rodrigues F, Galhardas H. A Taxonomy of Data Quality Problems. 2nd Int Work Data Inf Qual. 2005;219–33.
Li L, Peng T, Kennedy J. A Rule Based Taxonomy of Dirty Data. GSTF Int J Comput. 2011.
Müller H, Freytag J-C, Problems. Methods, and Challenges in Comprehensive Data Cleansing.2003.. Accessed 1 Mar 2019.
Seheult AH, Green PJ, Rousseeuw PJ, Leroy AM. Robust Regression and Outlier Detection. J R Stat Soc Ser A. 1989;152:133.
Hellerstein JM. Quantitative Data Cleaning for Large Databases. United Nations Econ Comm Eur. 2008;42.
Kang H. The prevention and handling of the missing data. Korean Journal of Anesthesiology. 2013;64:402–6.
Chu X, Ilyas IF, Krishnan S, Wang J. Data cleaning: Overview and emerging challenges. In: Proceedings of the ACM SIGMOD International Conference on Management of Data.2016. p. 2201–2206.
Vassiliadis P, Vagena Z, Skiadopoulos S, Karayannidis N, Sellis T. Arktos. A Tool for Data Cleaning and Transformation in Data Warehouse Environments. IEEE DATA ENG BULL.2000.. Accessed 1 Oct 2019.
Savik K, Fan Q, Bliss D, Harms S. Preparing a large data set for analysis: Using the Minimum Data Set to study perineal dermatitis. J Adv Nurs. 2005;52:399–409.
Dziadkowiec O, Callahan T, Ozkaynak M, Reeder B, Welton J. Using a Data Quality Framework to Clean Data Extracted from the Electronic Health Record: A Case Study. EGEMS. 2016;4:1201.
Harnos A, Fehérvári P, Csörgő T. Hitchhikers’ guide to analysing bird ringing data part 1: Data cleaning, preparation and exploratory analyses. Ornis Hungarica. 2015;23:163–88.
Kulkarni DK. Interpretation and display of research results. Indian Journal of Anaesthesia Indian Society of Anaesthetists. 2016;60:657–61.
Rahm E, Do HH. Data Cleaning: Problems and Current Approaches Erhard. IEEE Trans Cloud Comput. 2014. doi:.

Download PDF

Journal Publication

published 13 Nov, 2020

Read the published version in BMC Medical Informatics and Decision Making →

Editorial decision: Major revision
22 Jul, 2020
Review #2 received at journal
21 Jul, 2020
Reviewer #3 agreed at journal
30 Jun, 2020
Review #1 received at journal
30 Jun, 2020
Review #3 received at journal
30 Jun, 2020
Reviewer #2 agreed at journal
29 Jun, 2020
Reviewer #1 agreed at journal
29 May, 2020
Reviewers invited by journal
20 Apr, 2020
Editor assigned by journal
12 Apr, 2020
Submission checks completed at journal
11 Apr, 2020
Editor invited by journal
11 Apr, 2020
First submitted to journal
07 Apr, 2020

You are reading this older preprint version

Read the latest preprint version →

Data Cleaning Process for HIV Indicator Data Extracted from DHIS2 National Reporting System: Case Example of Kenya

Status:

Journal Publication

Version 1

Abstract

Background

Methods

Results

Conclusions

Figures

Background

Method

Data cleaning approaches

Data Cleaning Framework

Study Setting

Data Cleaning Process

Application of data cleaning process: Kenya HIV indicator reporting case example

Step 1: Outline the analyses or evaluation questions and goals

Step 2: Description Of Data And Study Variables

Step 3: Create The Database

Step 4: Application Of The Framework For Data Cleaning

A) Screening Phase

B) Diagnostic Phase

C) Treatment Phase

Step 5: Data Analysis

Results

Discussion

Conclusion

Abbreviations

Declarations

References

Status:

Journal Publication

Version 1