Record linkage without patient identifiers: proof of concept using data from South Africa’s national HIV program

Background: Linkage between health databases typically requires identifiers such as patient names and personal identification numbers. We developed and validated a record linkage strategy to combine administrative health databases without the use of patient identifiers, with application to South Africa’s public sector HIV treatment program. Methods: We linked CD4 counts and HIV viral loads from South Africa’s HIV clinical monitoring database (TIER.Net) and the National Health Laboratory Service (NHLS) for patients receiving care between 2015–2019 in Ekurhuleni District (Gauteng Province). We used a combination of variables related to lab results contained in both databases (result value; specimen collection date; facility of collection; patient year and month of birth; and sex). Exact matching linked on exact linking variable values while caliper matching applied exact matching with linkage on approximate test dates (± 5 days). We then developed a sequential linkage approach utilising specimen barcode matching, then exact matching, and lastly caliper matching. Performance measures were sensitivity and positive predictive value (PPV); share of patients linked across databases; and percent increase in data points for each linkage approach. Results: We attempted to link 2,017,290 lab results from TIER.Net (representing 523,558 unique patients) and 2,414,059 lab results from the NHLS database. Linkage performance was evaluated using specimen barcodes (available for a minority of records in TIER.net) as a “gold standard”. Exact matching achieved a sensitivity of 69.0% and PPV of 95.1%. Caliper-matching achieved a sensitivity of 75.7% and PPV of 94.5%. In sequential linkage, we matched 41.9% of TIER.Net labs by specimen barcodes, 51.3% by exact matching, and 6.8% by caliper matching, for a total of 71.9% of labs matched, with PPV=96.8% and Sensitivity = 85.9%. The sequential approach linked 86.0% of TIER.Net patients with at least one lab result to the NHLS database (N=1,450,087). Linkage to the NHLS Cohort increased the number of laboratory results associated with TIER.Net patients by 62.6%. Conclusions: Linkage of TIER.Net and NHLS without patient identifiers attained high accuracy and yield without compromising patient privacy. The integrated cohort provides a more complete view of patients’ lab history and could yield more accurate estimates of HIV program indicators.

South Africa's HIV treatment program is the largest in the world, with about 5.2 million adult patients on antiretroviral therapy (ART) in 2019 (1). Although mass provision of ART has reduced HIV-associated morbidity and mortality (2)(3)(4)(5)(6), HIV remains the fth leading cause of death in South Africa (7) with over 200,000 new infections annually (8).
The NHLS Cohort is the primary laboratory database including all data generated by public-sector medical laboratories while the TIER.Net contains data generated by clinical events at all HIV management centres in South Africa, including ART initiation, ART pick-up dates, regimen type, and clinic visits. Unlike TIER.Net, the NHLS data is nationally deduplicated, enabling longitudinal analyses even when patients re-enter care at other facilities (15,24). Both national in scope, TIER.Net and NHLS contain complementary data that clinicians draw on for patient care. However, the two databases are not currently integrated, and no consistent patient identi er exists to enable patient-level longitudinal analyses using information from both data sources.
Linkage of health databases traditionally requires primary personal identi ers such as names, national identi cation numbers, addresses, and phone numbers. However, with heightened concern for data privacy in South Africa and elsewhere, access to patient-identifying information is increasingly restricted to clinical management purposes (25,26). Validated techniques are therefore required to link databases without primary patient identi ers (27)(28)(29) to enable program monitoring, evaluation, and research. A variety of privacy-preserving record linkage (PPRL) approaches have been proposed, with most involving the encryption of primary identi ers behind data owners' rewalls and linkage of those encoded data (30). Here, we explore the feasibility of linkage without primary patient identi ers at all, relying instead on laboratory event information recorded in both databases. Our paper builds on prior efforts by Edward Nicol et al (31) and Ingrid Basset et al. (32) to link the NHLS with HIV patient management systems for speci c clinical cohorts in South Africa.
In this paper, we set out to develop and validate a linkage strategy for the link TIER.Net and the NHLS Cohort without patient identi ers. As a proof of concept, we used data from Ekurhuleni district, a large, mostly urban district in Gauteng province, South Africa. We developed multiple linkage approaches, validated their performance against "gold standard" data, and quanti ed the bene ts of linkage with respect to the completeness of the resulting database. Our goal was to create an integrated HIV cohort with comprehensive clinical and laboratory data that would enable longitudinal analyses of the full HIV care cascade not possible with NHLS or TIER.Net data alone.

Data and study population
The study population was all patients receiving HIV care in Ekurhuleni District from 1 Jan 2015-31 Dec 2019 at 102 publicsector health facilities with at least one CD4 count or HIV viral load during this period. We compiled data on this study population from two sources: TIER.Net and NHLS.
Three interlinked Electronic Registers (TIER.Net) TIER.Net is South Africa's facility-based electronic patient health data management system. Established in 2010 and scaled up in the following years, TIER.Net serves as the primary monitoring platform for the national HIV care and treatment program (9).
Data from patient charts are captured into TIER.Net by clinic staff. TIER.Net contains data on clinical events including ART initiation, ART pick-up dates, regimen type, and clinic visits. While laboratory tests (CD4 counts and HIV viral loads) information are also captured, the process is manual and inconsistent resulting in incomplete f (13). TIER.Net is not nationally networked (9), and lab results preceding HIV diagnosis and ART initiation are largely unavailable on TIER.Net (33,34). The TIER.Net patient ID is allocated by facilities, and patients who seek care at alternative facilities may receive a new TIER.Net patient ID, creating duplicate records and hindering tracking of patients across facilities (13).

National Health Laboratory Service (NHLS) National HIV Cohort
The NHLS provides all laboratory and pathology services for the country's public sector HIV care and treatment program (35). The NHLS maintains a centralised database of all laboratory test data (including CD4 count and HIV viral load (VL) data), with results logged to the NHLS Corporate Data Warehouse (CDW). The NHLS's CDW previously developed a linkage algorithm. More recently, a team at NHLS, University of Witwatersrand, and Boston University developed, implemented, and validated an improved record-linkage algorithm with much higher sensitivity, enabling analysis of the NHLS database as a national cohort covering all lab-monitored patients in South Africa's public sector HIV program (14,15). The NHLS National HIV Cohort has been used to track trends in CD4 counts at presentation, assess retention in care regardless of patient transfer, quantify treatment outcomes for different groups, and evaluate the impact of HIV policy changes (15)(16)(17)(18)(19)(20)(21)(22).

Variables used for linkage
Demographic and laboratory test variables De-identi ed data were extracted from the TIER.Net and NHLS databases. We extracted laboratory event-speci c demographic data [year of birth (YOB), month of birth (MOB), and sex of the patient], geographic data (province name, district name, subdistrict name, and health facility name), and details on all CD4 counts and HIV viral loads (result value, test date, and test type) taken between January 1, 2015 and December 31, 2019. The details of each of these linking variables are provided in Box 1. All variables were harmonised between the databases to ensure equivalent formatting. We linked health facilities starting with a crosswalk provided by the National Institute for Communicable Diseases (NICD) at NHLS. We then manually reviewed facility names within Ekurhuleni District to ensure correspondence. All health facility names were standardised to be the same across the two databases. We also retained de-identi ed unique patient IDs from each database. From TIER.Net, we extracted the "TIER.Net ID". From NHLS, we extracted a unique patient ID created by NHLS CDW (henceforth, "NHLS CDW ID") as well as the National HIV Cohort ID (henceforth, "NHLS Cohort ID"). At the time of writing, the NHLS Cohort ID was available only through March 2018.
Specimen barcodes as a gold standard matching variable Some TIER.Net laboratory results were recorded with their NHLS specimen barcode. These alphanumeric barcodes are allocated centrally by NHLS and are not duplicated within or across health facilities. Barcodes are a xed to biological specimens (e.g. blood test tubes), the corresponding NHLS test request form, the facility's specimen register, and are provided with the test results sent back to facilities from NHLS. The barcode is the same across all the tests performed on the same biological specimen. Except for cases where the same test was repeated on the same specimen, the combination of barcode and test type is unique. Barcodes are available for nearly all laboratory results in the NHLS database. Although the barcode data are highly incomplete in TIER.Net and cannot be used as the only linkage strategy, the combination of barcode and test type offers a highly accurate "gold standard" for validation of other approaches.
To con rm the suitability of using barcodes as a gold standard, we assessed the concordance of other test information (YOB, MOB, sex, test date and value, and facility name) when barcodes matched. We rst excluded labs where the specimen barcodes and test type were not unique (0.013%-TIER.Net and 0.005%-NHLS). We then identi ed lab results with the same barcode and test type in the two databases and quanti ed the % discordance in the associated test information. To assess the probability of barcode matching by chance, we randomly selected 100,000 lab results from TIER.net and linked them to 100,000 randomly sampled lab results from NHLS (Table S1). The expectation was that a very small share of these randomly selected pairs would be true matches. We quanti ed the proportion of discordance in the randomly-matched pairs. By comparing the share of barcode matches that were fully discordant in other characteristics with the share that would be expected to be fully discordant by random chance, we were able to estimate the false positivity rate in the barcode matching, under the assumption that all fully discordant records were different people. This is an upper bound on the false positivity rate, given that some of the fully discordant barcode matches actually may have been true matches with a lot of typographic error.

Box 1. Description of the linkage variables
The following variables were used to link laboratory tests in TIER.Net and the NHLS National HIV Cohort: • Specimen Barcode -Each blood specimen is assigned a unique NHLS specimen barcode.
• Test type -CD4 count or HIV viral load.
• Test facility -Facility where a specimen was taken (NHLS) or recorded (TIER.Net).
• Test taken date -"Taken date" in NHLS and "result date" in TIER.Net • Test result value -Numeric value of CD4 count or viral load test result. Some viral loads were classi ed as "lower than detectable limit"; these were coded as "0" for linkage.
• Year and month of birthextracted from the date of birth recorded for each NHLS specimen and each TIER.Net patient record.
• Biological Sexrecorded in TIER.Net and NHLS as "Male" or "Female" Exclusions before linkage

Methods for Record Linkage
We applied four record linkage approaches using the laboratory test result information.

Evaluating performance of the linkage methods
We used the subset of laboratory tests with specimen barcodes in TIER.Net and NHLS to assess the performance of the exact and calliper matching linkage strategies. We assumed that barcodes were missing completely at random. Record pairs where the specimen barcode matched were de ned as "true matches". Record pairs where the specimen barcode did not match were de ned as "true non-matches". Performance was assessed across four dimensions: sensitivity, positive predictive value (PPV), linkage yield and enrichment of the TIER.Net laboratory pro le because of the linkage. De nitions are provided below: a. Sensitivity. We computed the sensitivity as the proportion of "true matches" (i.e. barcode matches) that were identi ed by each linkage approach.
b. Positive predictive value (PPV). We computed PPV as the proportion of matches identi ed by each linkage approach that were a "true match".
The approach for estimation of these parameters for the sequential linkage is provided in Text S1. For sensitivity and PPV, we estimated exact binomial 95% con dence intervals using the one-sample Clopper-Pearson method (37). Because there are a very large number of true non-matches, speci city and negative predictive value are nearly 100% and are not reported (21 Figure S1 shows the number of lab results in TIER.Net and NHLS in Ekurhuleni District over time. NHLS had more lab results than TIER.Net throughout the study period, with about 25% more than TIER.Net in 2015 and about 15% more in 2019.  Sensitivity and positive predictive value for exact and caliper matching vis-à-vis barcode "gold standard" matching using NHLS-TIER.Net validation data We evaluated exact and caliper matching using barcodes as a gold standard. We identi ed 608,210 lab records with identical barcodes in TIER.Net and NHLS and considered these to be "true matches". All other pairs of laboratory records from TIER.Net and NHLS in which barcodes differed (and were non-missing) were considered "true non-matches". Exact matching yielded a total of 441,300 matches, of which 419,658 were "true matches", a sensitivity of 69.0% (95%CI: 68 Linkage performance and yield in the complete sample Using our calculations of sensitivity and PPV from the "gold standard" dataset, we then estimated these parameters for the complete sample of eligible TIER.Net and NHLS labs data when linked using four methods: barcodes matching, exact matching, caliper matching, and sequential linkage ( Fig. 2; Table S2). We additionally assessed "yield", i.e. the proportion of lab results and patients in TIER.Net that were linked to NHLS by each method.
Of all eligible TIER.Net labs, 608,210 labs matched on barcodes. Since all barcode matches were gold standard matches, PPV of this strategy was 100%, but sensitivity was estimated at just 95.8% and 59% of TIER.Net patients were matched. Second, several facilities could not be linked between TIER.Net and NHLS and were excluded, highlighting the need for a national, regularly maintained crosswalk with NHLS and Department of Health facility identi ers. Third, the NHLS National HIV Cohort was created using a validated algorithm, and like all deduplication efforts contains some matching errors. Fourth, not all laboratory results in TIER.Net could be linked to NHLS. We were unable to accurately link 28% of CD4 count and viral load results in TIER.Net to NHLS. We cannot be sure why they were not linked; however, other studies have noted inconsistencies between information recorded in patient les and that captured in TIER.Net (12,13,34,44). Fifth, our approach requires the availability of data on the same patient characteristics -here, laboratory test results -to facilitate linkage, and would not be suitable for linking databases that do not contain some shared data points. Finally, the study was limited to one mostly urban district in South Africa, although the methods are likely generalizable more broadly.

CONCLUSION
Despite the exploratory nature of our study, the ndings offer an exciting and readily available template for rapid integration of the NHLS National HIV Cohort and TIER.Net patient management system without compromising patient privacy and con dentiality for HIV research and policy evaluation in South Africa. Because 14% of TIER.Net patients with laboratory results -and all TIER.Net patients without laboratory results -remained unlinked, other methods, including the use of patient identi ers, should be used to create a comprehensive database for patient care and monitoring purposes.  Record linkage performance for each linkage strategy in the full dataset Note: The gure compares the estimated performance of each linkage strategy with respect to sensitivity, positive predictive value (PPV), lab-level linkage yield, and patient-level linkage yield. Estimates were based on extrapolation from the barcode subsample, under the assumption that barcodes were missing completely at random (S1 Text). Overall, the sequential linkage approach outperformed the other approaches with the highest linkage yield at the lab (71.9%) and patient level (86.0%), high sensitivity (85.9%), and PPV (96.8%).