Data cleaning and data quality assessment approaches
Data cleaning is defined as “the process used to determine inaccurate, incomplete, or unreasonable data and then improving the quality through correction of detected errors and omissions” [28]. Data cleaning is essential to transform raw data into quality data for purposes such as analyses and data mining [29]. It is also an integral step in the knowledge discovery of data (KDD) process [30].
There exists various issues within the data, which necessitate cleaning in order to improve its quality [31–33]. An extensive body of work exists on how to clean data. Some of the approaches that can be employed include quantitative or qualitative methods. Quantitative approaches employ statistical methods, and are largely used to detect outliers [34–36]. On the other hand, qualitative techniques use patterns, constraints, and rules to detect errors [37]. These approaches can be applied within automated data cleaning tools such as ARKTOS, AJAX, FraQL, Potter’s Wheel and IntelliClean [33], [37,38].
In addition, there are a number of frameworks used in assessment of data quality in health information systems, which can be utilized by countries with DHIS2. The Data Quality Review (DQR) tool developed in collaboration with WHO, Global Fund, Gavi and USAID/MEASURE Evaluation provides a standardized approach that aims at facilitating regular data quality checks [39]. Other tools for routine data quality assessments include the MEASURE Evaluation Routine Data Quality Assessment Tool (RDQA) [40] and WHO/IVB Immunization Data Quality Self-Assessment (DQS) [41].
Some of the data quality categories (intrinsic, contextual, representational and accessibility) [23], have been used in cleaning approaches as well as the data quality frameworks developed. A closer examination of the aforementioned approaches reveals focus on assessing intrinsic data quality aspects, which can be categorized further to syntactic quality (conformance to database rules) and semantic quality (correspondence or mapping to external phenomena) [42].
Moreover, while tools and approaches exist for data quality assessments as well as data cleaning, concerted efforts have been paced on assessment of health information system data quality [39,40], as opposed to cleaning approaches for secondary analyses, which are largely dependent on the context for data use [24]. Wang and Strong posited the need for considering data quality with respect to context of the tasks, which can be a challenge as tasks and context vary by user needs [23]. Therefore, specifying the task and relevant features for the task, can be employed for contextual data quality [23], [43].
With this in mind and based on our knowledge, no standard consensus-based approach exists to ensure that replicable and rigorous data cleaning approaches and documentation are applied on extracted DHIS2 data to be used in secondary analyses. As such, ad hoc data cleaning approaches have been employed for the extracted data prior to analyses [16–18]. Moreover, whereas some studies provide brief documentation of data cleaning procedures used [19], others lack documentation, leaving the data cleaning approaches used undisclosed and behind-the-scenes [20]. Failure to disclose approaches used makes it difficult to replicate data cleaning procedures, and to ensure that all types of anomalies are systematically addressed prior to use of data for analysis and decision-making. Furthermore, the approach used in data extraction and cleaning affects the analysis results [21].
Oftentimes, specific approaches are applied based on the data set and the aims of the cleaning exercise [10], [44,45]. Dziadkowiec et al used Khan’s framework to clean data extracted from relational database of an Electronic Health Records (EHR) (10). In their approach, intrinsic data quality was in our view considered in data cleaning with focus on syntactic quality issues (such as conforming to integrity rules). Miao et al. proposed a data cleaning framework for activities that involve secondary analysis of an EHR [45], which in our view considered intrinsic data quality with focus on semantic quality (such as completeness and accuracy). Savik et al. approached data cleaning in our view from a contextual perspective, which entailed preparing the dataset that is appropriate for the intended analysis [44].
In this study, we approach data cleaning from a contextual perspective, whereby only data fit for subsequent analyses is retained. Based on our data set, our study’s data cleaning approach was informed by a conceptual data-cleaning framework proposed by Van den Broeck et al. [21]. Van den Broeck et al’s framework was used because it provides a deliberate and systematic data cleaning guideline that is amenable to being tailored towards cleaning data extracted from DHIS2. This framework presents data cleaning as a three-phase process involving repeated cycles of data screening, data diagnosis, and data editing of suspected data abnormalities. The screening process involves identification of lacking or excess data, outliers and inconsistencies and strange patterns [21]. Diagnosis involves determination of errors or missing data and any true extremes and true normal [21]. Editing involves correction or deleting of any identified errors [21]. The various phases in Van den Broeck et al’s framework have also been applied in various settings [46,47]. Human-driven approaches complemented by automatic approaches were also used in the various data cleaning phases in thus study. Human-involvement in data cleaning has also been advocated in other studies [35].
Study Setting
This study was conducted in Kenya, a country in East Africa. Kenya adopted DHIS2 for use for its national reporting in 2011 [7]. The country has 47 administrative counties, and all the counties report a range of healthcare indicator data from care facilities and settings into the DHIS2 system. For the purposes of this study, we focused specifically on HIV-indicator data reported within Kenya’s DHIS2 system, given that these are the most comprehensively reported set of indicators into the system.
Kenya’s DHIS2 has enabled various quality mechanisms to deal with HIV data. Some of these include data validation rules, outlier analysis and minimum and maximum ranges, which have been implemented at the point of data entry. DHIS2 data quality tool is also an application that was included in DHIS2 to supplement the in-built data quality mechanisms [12]. Nonetheless it was not actively in use during our study period 2011-2018. The quality mechanisms as well as the DHIS2 quality tool consider intrinsic data quality aspects.
Data cleaning process
Adapting the Van den Broeck et al’s framework, a step-by-step approach was used during extraction and cleaning of the data from DHIS2. These steps are generic and can be replicated by others conducting robust data cleaning on DHIS2 for analyses. These steps are outlined below:
i. Step 1 - Outline the analyses or evaluation questions: Prior to applying the Van den Broeck et al’s conceptual framework, it is important to identify the exact evaluations or analyses to be conducted, as this helps define the data cleaning exercise.
ii. Step 2 - Description of data and study variables: This step is important for defining the needed data elements that will be used for the evaluation data set.
iii. Step 3 - Create the data set: This step involves identifying the data needed and extracting data from relevant databases to generate the final data set. Oftentimes, development of this database might require combining data from different sources.
iv. Step 4 - Apply the framework for data cleaning: During this step, the three data cleaning phases (screening, diagnosis, and treatment) in Van den Broeck et al’s framework are applied on the data set created.
v. Step 5 - Analyze the data: This step provides a summary of the data quality issues discovered, the eliminated data after the treatment exercise, and the retained final data set on which analyses can then be done.
Application of data cleaning process: Kenya HIV-indicator reporting case example
In this section, we present the application of the data cleaning sequence above using Kenya as case example. It is worth noting that in this study, the terms ‘programmatic area report’ and ‘report’ are used interchangeably as they contain the same meaning given that a report represents a programmatic area, and contains a number of indicators.
Step 1: Outline the analyses or evaluation questions and goals
For this reference case, DHIS2 data had to undergo the data cleaning process prior to use of the data for an evaluation question on ‘Performance of health facilities at meeting the completeness and timeliness facility reporting requirements by the Kenyan Ministry of Health (MoH)’. The goal was to identify the best performing and poor performing health facilities at reporting within the country, based on completeness and timeliness in submitting their reports into DHIS2.
This study only attempts to clean the data for further subsequent analyses. Thus, the actual analyses and evaluation will be conducted using the final clean data in a separate study.
Step 2: Description of data and study variables
HIV-indicator data in Kenya are reported into DHIS2 on a monthly basis by facilities offering HIV services using the MOH-mandated form called “MOH 731- Comprehensive HIV/AIDS Facility Reporting Form” (MOH731). As of 2011 to 2018, MOH 731 consisted of six programmatic areas representing six independent reports containing HIV-indicators to be reported [see Additional file 1]. The six reports and the number of indicators reported in each include: (i) HIV Counselling and Testing (HCT) – 14 indicators; (ii) Prevention of Mother-to-Child transmission (PMTCT) – 40 indicators; (iii) Care and Treatment (CrT) – 65 indicators; (iv) Voluntary Medical Male Circumcision (VMMC) – 13 indicators; (v) Post-Exposure Prophylaxis (PEP) – 14 indicators; and (vi) Blood Safety (BS) – 3 indicators.
Each facility offering HIV services is expected to submit reports with indicators every month based on the type(s) of services offered by that facility. Monthly due date for all reports are defined by the MoH, and the information on the expected number of reports per facility.
For our use case, we wanted to create a data set for secondary analyses, which was to determine performance of facilities at meeting the MoH reporting requirements (facility reporting completeness and timeliness of reporting). Hence, retain only facilities offering services for any of the six programmatic areas. Completeness in reporting by facilities within Kenya’s DHIS2 is measured as a continuous variable starting at 0% to 100% and identified within the system by a variable called ‘Reporting Rate (RR)’. The percentage RR is calculated automatically within DHIS2 as the actual number of reports submitted by each facility into DHIS2 divided by the expected number of reports from the facility multiplied by100 (Percentage RR = actual number of submitted reports / expected number of reports * 100). Given that MOH731 reports should be submitted by facilities on a monthly routine, the expected number of monthly reports per programmatic area per year is 12 (one report expected per month). It should be noted that this Reporting Rate calculation only looks at report submission and not the content within the reports. Given that facilities offering any of the HIV services are required to submit the full MOH731 form containing six programmatic area reports, zero (0) cases are reported for indicators where services are not provided, which appear as blank reports in DHIS2. As such, a report may be submitted as blank or have missing indicators but will be counted as complete (facility reporting completeness) simply because it was submitted. Timeliness is calculated based on whether the reports were submitted by the 15th day of the reporting month as set by the MoH. Timeliness is represented in DHIS2 as ‘Reporting Rate on Time (RRT)’ and is also calculated automatically. The percentage RRT for a facility is measured as a percentage of the actual number of reports submitted on time by the facility divided by the expected number of reports multiplied by 100 (Percentage RRT = actual number of reports submitted on time / expected number of reports * 100). Annual reports were therefore generated from DHIS2 consisting of percentage Reporting Rate and Reporting Rate on Time, which were extracted per facility, per year.
Step 3: Create the data set
After obtaining Institutional Review and Ethics Committee (IREC) approval for this work, we set out to create our database from three data sources as outlined below:
(1) Data Extracted from DHIS2: Two sets of data were extracted from DHIS2 to Microsoft Office Excel (version 2016). For the first data set, we extracted variables from DHIS2 for all HIV programmatic area reports submitted from all health facilities in all 47 counties in Kenya between the years 2011 and 2018, with variables grouped by year. Variables extracted from DHIS2 by year included: facility name, programmatic area report (e.g. Blood Safety), expected number of reports, actual number of submitted reports, actual number of reports submitted on time, cumulative Reporting Rate by year (calculated automatically in DHIS2) and cumulative Reporting Rate on Time by year (calculated automatically in DHIS2) [see Additional file 2]. The extracted data for Reporting Rate and Reporting Rate on Time constituted to the annual reports in the six programmatic areas for years 2011 to 2018, for the respective health facilities.
For the second data set, we extracted the HIV-indicator data elements submitted within each annual programmatic area report by the health facilities for all the six programmatic areas for every year under evaluation [see Additional file 1].The annual report contained cumulative HIV-indicator data elements gathered in each programmatic area per facility, per year.
In addition, extracting the aforementioned datasets from 2011 to 2018 resulted to repeated occurrence of the facility variable in the different years. For example, facilities registered in DHIS2 in 2011 will appear in subsequent years resulting to eight occurrences within the 8 years (2011-2018) per programmatic area report (e.g Blood Safety). These resulted to a facility containing the following variables per row: facility name, year, percentage Reporting Rate, and percentage Reporting Rate on Time for the six programmatic area reports. In this study, the facility data per row was referred to as ‘facility record’.
(2) Facility Information: We augmented the DHIS2 data with detailed facility information derived from Kenya Master Facility List (KMFL). This information included facility level (II-VI), facility type (such as dispensary, health center, medical clinic) and facility ownership (such as private practice, MoH-owned, owned by a non-governmental organization).
(3) Electronic Medical Record Status: We used the Kenya Health Information Systems (KeHIMS) list, which contains electronic medical records (EMR) implemented in health facilities in Kenya, to incorporate information on whether the facility had an EMR or not. Information from these three sources were merged into a single data set as outlined in Figure 1.
Step 4: Application of the framework for data cleaning
Figure 2 outlines the iterative cleaning process we applied adapting Van den Broeck et al’s framework. Data cleaning involved repeated cycles of screening, diagnosis, and treatment of suspected data abnormalities, with each cycle resulting in a new data set. Details of the data cleaning process is outlined in Figure 2.
a) Screening Phase
During the screening phase, five types of oddities need to be distinguished, namely: lack or excess of data; outlier (data falling outside the expected range); erroneous inliers; strange patterns in distributions and unexpected analysis results [21].
For determining errors, we used Reporting Rate and Reporting Rate on Time as key evaluation variables. Reporting Rate by itself only gives a sense of the proportion of expected reports submitted but does not evaluate whether exact HIV-indicator data elements are included within each report. To evaluate completion of HIV-indicator data elements within each of the programmatic area reports that were submitted, we created a new variable named ‘Cumulative Percent Completion (CPC)’. Using the annual report extracted for HIV-indicator data elements per facility, Cumulative Percent Completion was calculated by counting the number of non-blank values and dividing this by the total number of indicators for each programmatic area. As such, if a facility has reported on 10 out of 40 indicators in an annual report, it will have 25 percent on completeness. Therefore, Cumulative Percent Completion provides an aggregate annual summary of the proportion of expected indicator values that are completed within submitted reports. The results for Cumulative Percent Completion were then included as variables in the facility-records, described in step 3, section 1. This resulted to a facility-record containing the following variables per row: facility name, year, percentage Reporting Rate, percentage Reporting Rate on Time and Cumulative Percent Completion for the six programmatic areas.
b) Diagnostic Phase
The diagnostic phase enables clarification of the true nature of the worrisome data points, patterns, and statistics. Van den Broeck et al. posits possible diagnoses for each data point as: erroneous, true extreme, true normal or idiopathic (no diagnosis found, but data still suspected to having errors) [21]. We used a combination of Reporting Rate, Reporting Rate on Time and Cumulative Percent Completion to detect various types of situations (errors or no errors) for each facility per annual report (Table 1). Using the combination of Cumulative Percent Completion, Reporting Rate, and Reporting Rate on Time we were able to categorize the various types of situations to be used in diagnosis for every year a facility reported into DHIS2 (Table 1). In this table, “0” represents a situation where percentage is zero; “X” represents a situation where percentage is above zero; and “>100%” represents a situation where percentage is more than 100. This data points were considered as erroneous records as the percentage reporting rate cannot go beyond 100 as this is not logically possible. Based on the values per each of the three variables, it was possible to diagnose the various issues within DHIS2 (Diagnosis Column).
For each programmatic area report (e.g. Blood Saftey) we categorized facilities by year and variables. All health facilities with an average Cumulative Percent Completion, Reporting Rate, and Reporting Rate on Time of zero (0) across all reports were identified as not having reported for the year and were henceforth excluded – as demonstrated by examples of Facility A and B in Table 2.
Beyond categorization of the various situations by report type, facility and year as defined above, errors related to duplicates were also identified using two scenarios. The first scenario of duplicates included a situation where health facilities had similar attributes such as year, name and county, with different data for Reporting Rate and Reporting Rate on Time. The second scenario of duplicates involves a situation where health facilities had similar attributes such as year, name and county, with similar data for Reporting Rate, and Reporting Rate on Time.
c) Treatment Phase
This is the final stage after screening and diagnosis, and entails deciding on the action point of the problematic records identified. Van den Broeck et al. limit the action points to correcting, deleting or leaving unchanged [21]. Based on the diagnosis illustrated in Table 1, facility-records in situation A-F were deleted hence excluded from the study. Duplicates identified in the scenarios mentioned were also excluded from the study. As such, for duplicates where health facilities had similar attributes such as year, name, and county, with different data for Reporting Rate, and Reporting Rate on Time, all entries were deleted. For duplicates where health facilities had similar attributes such as year, name, and county, with similar data for Reporting Rate, and Reporting Rate on Time, only one entry was deleted. Only reports in situation G and H were considered ideal for the final clean data set.
Step 5: Data analysis
The facility-records were then disaggregated to form six individual data sets representing each of the programmatic areas containing the following attributes: facility name, year, Cumulative Percent Completion, percentage Reporting Rate and percentage Reporting Rate on Time, as well as the augmented data on facility information and EMR status. The disaggregation was because facilities offer different services and do not necessarily report indicators for all the programmatic areas. SPSS was used to analyze the data using frequency distributions and cross tabulations in order to screen for duplication and outliers. Individual health facilities with frequencies of more than eight annual reports for a specific programmatic area were identified as duplicates. The basis for this is that the maximum annual reports per specific programmatic area for an individual health facility has to be eight, given that data was extracted within an eight-year period. From the cross tabulations, percentage Reporting Rate and percentage Reporting Rate on Time that were above 100% were identified as erroneous records.
After the multiple iterations of data cleaning as per Figure 2, where erroneous data were removed by situation type (identified in Table 1), a final clean data set was available and brought forward to be used in a separate study for subsequent secondary analyses (which include answering the evaluation question in step 1). At the end of the data cleaning exercise, we determined the percentage distribution of the various situation types that resulted in the final data set. The percentages were calculated by dividing the number of facility-records in each situation type by the total facility-records in each programmatic area respectively, which was then multiplied by 100. As such, only data sets disaggregated into the six programmatic areas were included in the analysis. Using this analysis and descriptions from Table 1, we selected situation B, and situation D, in order to determine if there is a difference in distribution of facility records containing the selected situation types in the six programmatic areas across the 8 years (2011-2018).
This will enable comparing distribution of facility records by programmatic area categorized by situation B and situation D. The data contains related samples and is not normally distributed. Therefore, a Friedman analysis of variance (ANOVA) was conducted to examine if there is a difference in distribution of facility reports by programmatic area across all years N=8 (2011 to 2018) for the selected situation types. As such, the variables analyzed include year, situation type, programmatic area, and unit of analysis include number of records in each situation type for a programmatic area. The distribution of facility-records was measured in all the six programmatic areas across the eight years and categorized by situation type. Wilcoxon Signed Rank Test were carried out as post hoc tests to compare significances in facility report distribution within the programmatic areas.
Below, we report on findings from the iterative data cleaning exercise and the resulting clean data set. The results further illustrate the value of the data cleaning exercise.