Data were collected from 362 participants. The EDC database contained a duplication of the unique identifier (CODE) in two cases, accordingly four participants were excluded from both databases, resulting in 358 (98.9%) participants with paired data collected by both systems. Each data set contained 56 variables (supplementary table 1). A total of 4 (7.1%) variables were excluded from this analysis: the variable “CODE” as this was the linking variable between datasets, “Study Name” (STUDY) since this was autocompleted among both systems, “Patient Initials” (INI) and “Place of Birth” (POB) as these were entered as free text and had to be translated from Devanagari to Roman letters. From the 52 (92.9%) included variables, a total of 3 (5.8%) variables contained dates, 2 (3.8%) variables recorded a specific time (in 24-hour format), 10 (19.2%) variables contained continuous data, 35 (67.3%) variables contained categorical data and 2 (3.8%) variables contained text where it was assumed that data collectors would know the correct spelling of all possible answers in Roman letters (“Diagnosis - DIAG” and “Main Place of Residence - MPR”, supplementary table 1).
Discrepancies between both data sets were found in 12.6% (2,352/18,616) of all entries, with differences between databases detected in 18.0% (643/3,580) of continuous variables, 15.8% (113/716) of time variables, 13.0% (140/1,074) of date variables, 12.0% (86/716) text variables, and 10.9% (1,370/12,530) of categorical variables (Table 1 and Fig. 1).
A total of 64% (1,499/2,352) of all discordant entries were due to data being entered in one system, but not the other. The largest proportion of omissions was among categorical variables (76.6%, 1,148/1499), followed by continuous variables (17.9%, 269/1499), dates (4.9%, 74/1499, all in the EDC) and text variables (0.5%, 8/1499). Overall 66.8% (1002/1499) of data were omitted from PDBC/Epidata method compared to 33.2% (497/1499) of data entered in the EDC database (p<0.001). Significantly higher proportions of omissions were found in the PBDC database among categorical and continuous variables, date and text variables had significantly higher proportions of missing entries in the EDC database (all p<0.05) (Table 1).
Depending on variable format, entries “0”, “9” and “99” should have been entered in case data was not available, a question was not relevant or a test result was negative. In 42% (624/1,499) of all discordant blanks found in the PBDC database only, a respective entry was found in the EDC database, while the opposite was the case in 3% (44/1,499) of records (p<0.01). Among discordant entries, one date entry in the EDC database and one continuous variable in the PBDC database were found to be out of range, one logical error among discordant date variables and one among discordant continuous variables was found within the EDC database (Table 1).
Among the 10 continuous variables, the median relative difference between entries ranged from 1.0% (interquartile range (IQR): 0.39–1.00, range: 0.20–1.94) for measured body temperature (TEMP) to 55.1% (IQR: 34.09–76.46, range: 2.44–95.33) for malaria parasite count per white blood cells (APC).
A total of 33 variables were collected with the patient present, with discordant entries present in 5.8% (685/11,814), significantly fewer (p<0.001) than among the 19 variables collected in the laboratory where 24.5% (1,667/6,802) of all entries differed between both systems (Fig. 2). The observed difference was applicable to all relevant categories (all p<0.05), with 3.2% (374/11,814) blanks generated for the patient data and 16.6% (1,130/6,802) for the laboratory data; p<0.001.