Data cleaning approaches
Data cleaning is defined as “the process used to determine inaccurate, incomplete, or unreasonable data and then improving the quality through correction of detected errors and omissions” (13). Data cleaning is essential to transform raw data into quality data for purposes such as analysis and data mining (14). An extensive body of work exists on how to clean data (15–17). Some of the approaches that can be employed include quantitative or qualitative methods. Quantitative approaches employ statistical methods, and are largely used to detect outliers (18–20). On the other hand, qualitative techniques use patterns, constraints, and rules to detect errors (21). These approaches can be applied within automated data cleaning tools such as ARKTOS, AJAX, FraQL, Potter’s Wheel and IntelliClean (17, 21, 22).
Data Cleaning Framework
While tools and approaches exist for data cleaning, no standard consensus-based approach exists to ensure that replicable and rigorous data cleaning standards are applied on DHIS2 data that is widely used by countries for health decision-making. Consequently ad hoc data cleaning approaches have been employed, with failure by implementations to explicitly disclose the systematic data cleaning strategies used and the resulting errors identified. This makes it difficult to replicate data cleaning procedures and to ensure that all types of quality issues are systematically addressed prior to use of data for analysis and decision-making.
There are a limited number of frameworks that exist to guide error detection within data sets, and these can be adapted in recommending a systematic approach for cleaning DHIS2 data. Oftentimes, specific frameworks are applied based on the data set and the aims of the cleaning exercise (23, 24). Our study’s data cleaning approach was informed by a conceptual data-cleaning framework recommended by Broeck et al.(11). Broeck et al’s framework was used because it provides a deliberate and systematic data cleaning guideline that is amenable to being tailored towards cleaning data extracted from DHIS2. This framework presents data cleaning as a three-phase process involving repeated cycles of data screening, data diagnosis, and data editing of suspected data abnormalities. The screening process involves identification of lacking or excess data, outliers and inconsistencies and strange patterns(11). Diagnosis involves determination of errors or missing data and any true extremes and true normals(11). Editing involves correction or deleting of any identified errors(11). Broeck et al’s framework has also been extensively applied and validated in various settings(25, 26).
Study Setting
This study was conducted in Kenya, a country in East Africa. Kenya adopted DHIS2 for use for its national reporting in 2011 (4). The country has 47 administrative counties and all the counties report a range of healthcare indicator data from care facilities and settings into the DHIS2 system. For the purposes of this study, we focused specifically on HIV indicator data reported within Kenya’s DHIS2 system, given that these are the most comprehensively reported set of indicators into the system.
Data Cleaning Process
Adapting the Broeck et al’s framework, a step-by-step approach was used during extraction and cleaning of the data from DHIS2. These steps are generic and can be replicated by others conducting robust data cleaning on DHIS2. These steps are outlined below:
i. Step 1 - Outline the analyses or evaluation questions: Prior to applying the Broeck et al’s conceptual framework, it is important to identify the exact evaluations or analyses to be conducted, as this helps define the data cleaning exercise.
ii. Step 2 - Description of data and study variables: This step is important for defining the needed data elements that will be used for the evaluation data set.
iii. Step 3 - Create the database: This step involves identifying the data needed and extracting data from relevant databases to generate the final data set. Oftentimes, development of this database might require combining data from different sources.
iv. Step 4 - Apply the framework for data cleaning: During this step, the three data cleaning phases (screening, diagnosis, and treatment) in Broeck et al’s framework are applied on the data set created.
v. Step 5 - Analyze the data: This step provides a summary of the data quality issues discovered, the eliminated data after the treatment exercise, and the retained final data set on which analyses can then be done.
Application of data cleaning process: Kenya HIV indicator reporting case example
In this section, we present the application of the data cleaning sequence above using Kenya as case example.
Step 1: Outline the analyses or evaluation questions and goals
For this reference case, DHIS2 data had to undergo the data cleaning process prior to use of the data for an evaluation question on ‘Performance of health care facilities at meeting the mandated HIV-indicator reporting requirements by the Kenyan Ministry of Health (MOH)’. The goal was to identify the best performing and poorest health facilities at reporting within the country, using the completeness and timeliness of their reports into DHIS2.
Step 2: Description Of Data And Study Variables
HIV indicator data in Kenya are reported into DHIS2 on a monthly basis using the MOH-mandated form called “MOH 731- Comprehensive HIV/AIDS Facility Reporting Form” (MOH 731). As of 2011 to 2018, MOH 731 consisted of six independent reports representing six programmatic areas in which HIV indicators were reported. The six reports and the number of indicators reported in each include: (i) HIV counselling and testing (HCT) – 10 indicators; (ii) Prevention of Mother-to-Child transmission (PMTCT) – 40 indicators; (iii) Care and Treatment (CRT) – 65 indicators; (iv) Voluntary Medical Male Circumcision (VMMC) – 13 indicators; (v) Post-Exposure Prophylaxis (PEP) – 14 indicators; and (vi) Blood Safety (BS) – 3 indicators. Each facility is expected to submit between 0 to 6 reports every month based on the type(s) of services offered by that facility. Monthly due date for all reports are defined by the Ministry of Health (MOH), and the information on the expected number of reports per facility.
For our use case, we wanted to create a data set to determine performance of facilities at meeting the MOH reporting requirements by evaluating completeness and timeliness of reporting. Completeness in reporting by facilities within Kenya’s DHIS2 is measured as a continuous variable starting at 0–100% and identified within the system by a variable called ‘Reporting Rate (RR). RR is calculated automatically within DHIS2 as a percentage of the actual number of reports submitted by each facility into DHIS2 divided by the expected number of reports from the facility (Percent RR = # submitted reports / expected # of reports * 100). It should be noted that this RR calculation only looks at report submission and not the content within the reports. As such, a report may be submitted as blank or have missing indicators, but will be counted as complete simply because it was submitted. At the end of each year, DHIS2 calculates the cumulative RR for the whole year. Timeliness is calculated based on whether the reports were submitted by the 15th day of the reporting period as set by the MOH. Timeliness is represented in DHIS2 as ‘Reporting Rate on Time (RRT)’, and is also calculated automatically. The RRT for a facility is measured as a percentage of the actual number of reports submitted on time by the facility divided by the expected number of reports (Percent RRT = # reports submitted on time / expected # of reports * 100).
Step 3: Create The Database
After obtaining Institutional Review Board (IRB) approval for this work, we set out to create our database from three data sources as outlined below:
(1) Data Extracted from DHIS2: We extracted variables from DHIS2 for all HIV reports submitted from all facilities in all 47 counties in Kenya between the years 2011 and 2018, with variables grouped by year. Variables extracted from DHIS2 by year included: facility, programmatic area or report (e.g. Blood Safety), expected number of reports, actual number of submitted reports, actual number of reports submitted on time, cumulative Reporting Rate (RR) by year (calculated automatically by DHIS2) and cumulative RRT by year (calculated automatically by DHIS2). We also extracted the individual indicator data submitted within each report by the health facilities for all the six programmatic areas for every year under evaluation.
There has been an increase in registered facilities within DHIS2 from 2011 to 2018. In addition, extracting the above data from 2011 to 2018 resulted to repeated occurrence of the facility variable in the different years. For example, facilities registered in DHIS2 in 2011 will appear in subsequent years resulting to eight occurrences within the 8 years (2011 to 2018). In this study, these repeated occurrences of facilities are referred to as ‘reporting instances’.
(2) Facility Information: We augmented the DHIS2 data with detailed facility information derived from Kenya Master Facility List (KMFL). This information included facility level (II-VI), facility type (such as dispensary, health center, medical clinic) and facility ownership (such as private practice, MOH-owned, owned by a non-governmental organization).
(3) Electronic Medical Record Status: We used the Kenya Health Information Systems (KeHIMS) list, which contains electronic medical records (EMR) implemented in health facilities in Kenya, to incorporate information on whether the facility had an EMR or not. Information from these three sources were merged into a single data set as outlined in Fig. 1.
Step 4: Application Of The Framework For Data Cleaning
Figure 2 outlines the iterative cleaning process we applied adapting Broeck et al’s framework. Data cleaning involved repeated cycles of screening, diagnosis, and treatment of suspected data abnormalities, with each cycle resulting in a new data set. Details of the data cleaning process is outlined in Figure 2.
A) Screening Phase
During the screening phase, five types of oddities need to be distinguished, namely: lack or excess of data; outlier (including inconsistencies); erroneous inliers; strange patterns in distributions and unexpected analysis results (11). For determining errors, we used RR and RRT as key evaluation variables. RR by itself only gives a sense of the proportion of expected reports submitted, but does not evaluate whether exact indicators are included within each report. To evaluate completion of report of each indicator with the reports that were submitted, we created a new variable named ‘Cumulative Percent Completion (CPC)’. CPC provides an aggregate annual summary of the proportion of expected indicator values that are completed within submitted reports.
B) Diagnostic Phase
The diagnostic phase enables clarification of the true nature of the worrisome data points, patterns, and statistics. Broeck et al. posits possible diagnoses for each data point as: erroneous, true extreme, true normal or idiopathic (no diagnosis found, but data still suspected to having errors) (11). We used a combination of RR, RTT and CPC to detect various types of situations (errors or no errors) for each facility per annual report (Table 1). Using the combination of CPC, RR, RRT we were able to categorize the various types of situations to be used in diagnosis for every year a facility reported into DHIS2 (Table 1). In this table, “0” represents a situation where percentage is zero; “X” represents a situation where percentage is above zero; and “>100%” represents a situation where percentage is more than 100. Based on the values per each of the three variables, it was possible to diagnose the various issues within DHIS2 (Diagnosis Column).
Table 1
Categorization of the various situations within DHIS2 and actions taken
Situation | CPCa | RRb | RRTc | Diagnosis | Action |
A | 0 | 0 | 0 | Nothing was reported by facilities during this period, signifying that the facility does not report to DHIS2. This could be a true normal. | Excluded |
B | 0 | X | X | Submitted reports might be on time, but are empty. Can result from programs wanting to have full MOH731 submission even though they do not offer services in all the 6 programmatic areas– hence submitting empty reports from non-required programmatic areas. (Report is useless to decision-maker as it is empty) | Excluded |
C | 0 | X | 0 | Submitted reports are empty and not on time. (Report is useless to decision-maker as it is empty and not on time) | Excluded |
D | X | 0 | 0 | No values present for RR and RRT. However, the reports are not empty | Excluded |
E | X | > 100% | X | Over-reporting (outliers) | Excluded |
F | X | > 100% | > 100% | Over-reporting (outliers) | Excluded |
G | X | X | X | Reports submitted on time with relevant indicators included. Ideal situation | Included |
H | X | X | 0 | Submitted reports with data elements in them, but not submitted in a timely manner | Included |
aCPC – Cumulative Percent Completion |
bRR – Reporting Rate |
cRRT – Reporting Rate on Time |
For each type of report (e.g. HCT) we categorized facilities by year and variable. All health facilities with an average CPC, RR and RRT of zero (0) across all reports were identified as not having reported for the year and were henceforth excluded – as demonstrated by examples of Facility A and B in Table 2. |
Table 2
Sectional illustration of first data set
Year | Organisation unit | CPC-HCT | RR-HCT | RRT-HCT | CPC-BS | RR-BS | RRT-BS | … | AVG-CPC | AVG-RR | AVG-RRT |
2016 | Facility A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2016 | Facility B | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2017 | Facility C | 10 | 90 | 80 | 100 | 90 | 80 | 0 | 50 | 60 | 50 |
Beyond categorization of the various situations by report type, facility and year as defined above, errors related to duplicates were also identified using two scenarios. The first scenario of duplicates included a situation where health facilities had similar attributes such as year, name and county, with different data for RR and RRT. The second scenario of duplicates involves a situation where health facilities had similar attributes such as year, name and county, with similar data for RR, and RRT.
C) Treatment Phase
This is the final stage after screening and diagnosis, and entails deciding on the action point of the problematic instances identified. Broeck et al. limits the action points to correcting, deleting or leaving unchanged. Based on the diagnosis illustrated in Table 1, reports in situation A-F were deleted hence excluded from the study. Duplicates identified in the scenarios mentioned were also excluded from the study. Thus, only reports in situation G and H were considered ideal for the final clean data set.
Step 5: Data Analysis
The data was then disaggregated to form six individual data sets representing each of the programmatic areas containing the facility and year. The disaggregation was because facilities offer different services and do not necessarily report in all the programmatic areas. SPSS was used to analyze the data using frequency distributions and cross tabulations in order to screen for duplication and outliers. Individual health facilities with frequencies of more than eight reporting instances for a specific report type (data set) were identified as duplicates. The basis for this is that the maximum reporting instances for an individual health facility has to be eight, given that data was extracted within an eight-year period. From the cross tabulations, percentage RR and RRT that were above 100% were identified as outliers.
After the multiple iterations of data cleaning as per Fig. 2, where erroneous data were removed by situation type (identified in Table 1), a final clean data set was available and brought forward for use in answering the evaluation question. At the end of the data cleaning exercise, we determined the percentage distribution of the various situation types that resulted in the final data set. Using this analysis and descriptions from Table 1, we selected situations with errors (D), and empty reports (B), in order to determine if there is a difference in distribution of facilities within this situations in the six programmatic areas. As such, only data sets disaggregated into the six programmatic areas were included in the analysis.
This will enable comparing distribution of facilities (submitting reports in each programmatic area) categorized by error and empty reports. The data contains related samples, and is not normally distributed. Therefore, a Friedman analysis of variance (ANOVA) was conducted to examine if there is a difference in distribution of facilities by programmatic area across all years N = 8 (2011 to 2018) for the selected situation types. The dependent variable is the distribution of facilities. The distribution of facilities will be measured in all the six programmatic areas across the eight years and categorized by situation type. Wilcoxon Signed Rank Test were carried out as post hoc tests to compare significances in facility distribution within the programmatic areas.
Below, we report on findings from the iterative data cleaning exercise and the resulting clean data set used to answer the evaluation question on performance of health care facilities at meeting the mandated HIV-indicator reporting requirements. The results further illustrate the value of the data cleaning exercise.