Exploring the validity of routine individuated service data for antenatal HIV surveillance in the Western Cape

Background In the Western Cape, South Africa, public-sector individual-level routine data are consolidated from multiple sources through the Provincial Health Data Centre (PHDC). This enables the description of temporal changes in population-wide antenatal HIV seroprevalence. We evaluated the validity of these data compared to aggregated program data and population-wide sentinel antenatal HIV seroprevalence surveys for the Western Cape province. Methods We conducted a retrospective cohort analysis of all pregnancies identified in the PHDC from January 2011 to December 2020. Evidence of antenatal and HIV care from electronic platforms were linked using a unique patient identifier. HIV prevalence estimates were triangulated and compared with available survey estimates and aggregated programmatic data from registers as recorded in the District Health Information System. Provincial, district-level and age-group HIV prevalence estimates were compared between data systems using correlation coefficients, absolute differences and trend analysis. Results Of the 977800 pregnancies ascertained, PHDC HIV prevalence estimates from 2011–2013 were widely disparate from aggregate and survey data (due to incomplete electronic data), whereas from 2014 onwards, estimates were within the 95% confidence interval of survey estimates, and closely correlated to aggregate data estimates (r = 0.8; p = 0.01), with an average prevalence difference of 0.4%. PHDC data show a slow but steady increase in provincial HIV prevalence from 16.7% in 2015 to 18.6% in 2020. The highest HIV prevalence was in the Cape Metro district (20.3%) Prevalence estimates by age group were comparable between sentinel surveys and PHDC from 2015 onwards, with prevalence estimates stable over time among younger age-groups (15–24 years) but increased among older age-groups (> 34 years). Conclusions This study compares sentinel seroprevalence surveys with both register-based aggregate data and consolidated individuated administrative data. We show that in this setting linked individuated data may be reliably used for HIV surveillance and provide more granular estimates with greater efficiency than seroprevalence surveys and register-based aggregate data.


Introduction
Monitoring the progress of HIV programmes is reliant, in part, on robust surveillance.Whilst global HIV surveillance has evolved since the beginning of the epidemic(1, 2), survey-based approaches remain the mainstay in many settings.Since 1990, the National Antenatal Sentinel HIV Survey has been conducted among pregnant women at selected public health antenatal clinics in all nine provinces of South Africa annually or biennially (3)(4)(5).Until 2015, the surveys were limited to women presenting for their rst antenatal visit, but since then follow-up antenatal visits have also been included (3,6,7).Due to the national antenatal survey being underpowered to provide accurate estimates at district and sub-district level, the Western Cape Provincial Department of Health expanded the national survey to additional sites from 2001 to 2015 to generate more accurate subprovincial estimates (4).
In 2013, WHO and UNAIDS published Guiding principles on ethical issues in HIV surveillance indicating that unlinked anonymous testing should only be used where there is demonstrable inadequacy of program data for surveillance purposes(8).The WHO 2015 Consolidated guidelines on HIV testing services, and the WHO 2022 Consolidated guidelines on person-centred HIV strategic information, recommend moving towards using routine programmatic data for antenatal HIV surveillance(8-10).
To facilitate this transition in surveillance strategy, evaluations of routine data are required.A 2020 assessment of the national prevention-of-mother-to child programme concluded that South Africa was close to achieving the transition to routine data surveillance in relation to HIV testing, but that further evaluation of data completeness and accuracy was required (11,12).While these studies look speci cally at the validity of routine HIV testing data in comparison to laboratory-based HIV testing data from surveys, it should be emphasised that HIV patient management in South Africa is based on routine HIV point-of-care testing results.Reassuringly, available literature shows high antenatal HIV testing coverage in South Africa of up to 98% by 2011 (13).Use of routine data for surveillance purposes requires further validation with other surveillance data sources.
In the Western Cape province of South Africa there are two routine programmatic health information systems related to maternal and HIV care.Aggregated provincial data captured from service-based registers are housed in the District Health Information System (DHIS) (14).The HIV testing services register comprises 46 reporting elements (manual and calculated) for key HIV indicators.The Western Cape Provincial Government has additionally developed a Provincial Health Data Centre (PHDC) in which all individual-level routine data captured electronically in the province are consolidated on a single platform, leveraging the patient folder number as the unique patient identi er (14,15).This allows linkage of various information systems, including laboratory, pharmacy, and patient administration data, providing a rich source of individuated health information, in the absence of routine electronic patient health records.Within this environment, disease-speci c patient cascades (virtual cohorts), such as the HIV care cascade and maternity cascade, may be developed using speci c markers of care at different points (14).
Several studies in low-and middle-income countries have compared antenatal survey estimates with routine HIV data (8,(16)(17)(18)(19)(20)(21).A 2013 study in the Western Cape showed comparable estimates between antenatal survey data and aggregated antenatal HIV data, but also reported disparities at the sub-district level(4).To date there have been no studies in the South African context comparing antenatal survey HIV estimates with individuated routine data estimates.We sought to evaluate the validity of routine HIV prevalence estimates in pregnant women in the Western Cape Province of South Africa as compared to sentinel surveillance.

Methods
The study was set in the Western Cape province of South Africa, which is comprised of one metropolitan district, Cape Metro and ve less urbanised districts (Overberg, Garden Route, Central Karoo, West Coast and Cape Winelands).We compared crosssectional HIV antenatal survey estimates to those derived from aggregated program data and individuated program data 2011 to 2020.To evaluate the validity of routine HIV prevalence estimates in pregnant women we analysed four datasets; these are described below.

National antenatal survey data
National sentinel antenatal survey HIV prevalence estimates for the Western Cape province were obtained from the South African National Department of Health.District-level and age-disaggregated estimates were provided on all pregnant women attending their rst antenatal visit in a public health facility in the Western Cape during a 6-week survey period in the years 2011-2015.Post 2015, the national antenatal survey was conducted biennially.In years 2015, 2017 and 2019, the national survey data included pregnant women attending rst antenatal visits or follow-up antenatal visits in a public health facility during a 6-week period.However, to promote consistency with earlier surveys, we limit our estimates to rst antenatal visit.Blood specimens of survey participants were tested for HIV at a central laboratory using HIV ELISA tests.

Expanded provincial antenatal survey data
Provincial sentinel antenatal survey data 2011-2015 (subsequently discontinued), incorporating larger, proportionally weighted sample sizes, were obtained from the Western Cape Department of Health.To attain provincial estimates, sub-district data were weighted using the proportional distribution of antenatal rst visits in the prior year.

DHIS (Routine aggregated HIV program data)
Aggregated program data included all women attending their rst antenatal visit at primary health care facilities in the Western Cape 2011-2020.All women, regardless of HIV status, are offered a point-of-care HIV test at rst antenatal visit.Results of these tests are captured in facility-based registers, initially prevention-of-mother-to-child transmission (PMTCT) register and later primary health care services register.As PMTCT register data were limited from 2011-2013, provincial estimates from 2011 and 2012 were obtained from a study comparing routine aggregated data to sentinel surveys (4).From 2014 to 2020, the proportion of pregnant women at rst visit with evidence of prior HIV diagnosis or new diagnosis, derived from the primary health care services register, was used to calculate prevalence estimates."HIV positive PMTCT initial test" and "Known HIV positive client" elements were combined as the numerator, with total "Antenatal rst visit" as the denominator.Routine HIV testing is based on a point-of-care HIV testing algorithm conducted by an HIV counsellor or nurse, where only discordant test results are con rmed by central laboratory-based ELISA testing.Known HIV positive status is captured by an HIV counsellor or nurse based on self-reported HIV and medical records con rming HIV diagnosis.

PHDC (Routine individuated HIV program data)
A retrospective cohort was derived from the PHDC which included de-identi ed linked data of all pregnant women attending public health facilities across the province 2011-2020.The cohort was enumerated using the PHDC maternity cascade which links electronic records of all patients with administrative or laboratory evidence indicative of pregnancy.Pregnancies inferred with high con dence were included, as they have at least one high con dence evidence such as a rhesus antibody test (conducted routinely at rst antenatal visit), pregnancy test, International Classi cation of Diseases (ICD) Tenth Revision code indicating pregnancy or pregnancy outcome, maternal discharge summary or repeat moderate con dence evidences such as antenatal visits.Since information speci c to the rst visit is not routinely captured in all public health facilities, it was not possible to distinguish between rst and follow-up antenatal visits, however all pregnancies were only captured once.District and sub-district for pregnancy was determined by the geographic location of the facility of rst pregnancy evidence.The pregnancy period was estimated using the pregnancy outcome date and/or any available evidence on gestational age, with year allocated according to the date of rst record of pregnancy.Since HIV diagnosis is based on point-of-care testing, these results are not digitised and therefore not available to the PHDC.In the absence of these testing data, administrative, laboratory and pharmacy evidence of HIV diagnosis before or during the estimated pregnancy period was used to determine antenatal HIV status of women in the maternity cohort.The proportion of pregnancies with electronic evidence of HIV diagnosis prior to pregnancy or during pregnancy amongst all pregnant women was used to calculate antenatal HIV prevalence estimates.

Analysis
Data were analysed using Microsoft Excel and Stata 17 (Stata Corporation, College Station, Texas, USA).Measures of central tendency and dispersion were used to describe continuous variables, depending on distribution.Categorical variables were described using proportions and 95% con dence intervals, using the normal approximation to the binomial distribution.Descriptive characteristics of the PHDC cohort (2014-2020) were validated with 2014 provincial antenatal survey data.
Prevalence estimates from provincial surveys, and aggregated and individuated program data, were calculated for comparison with national survey estimates, serving as the gold standard in this comparative analysis.Data were analysed by year at provincial and district level for comparison.
Provincial estimates from all datasets were further compared with provincial antenatal HIV prevalence estimates from the Thembisa mathematical model (22).This is an integrated demographic and HIV model for South Africa, calibrated to a number of HIV data sources including antenatal HIV prevalence surveys (22).Quantitative comparisons were analysed using correlation coe cients and average percentage differences.Individuated data from 2015 onwards were further disaggregated by district and age for prevalence estimates.Provincial age-group estimates were compared between PHDC and the national antenatal survey from 2015.There were no comparators available for district-level age group estimates as the national antenatal survey is underpowered for this purpose.Data were categorised using age categories routinely used in antenatal survey reporting.

Ethical considerations
The study was approved by the University of Cape Town Human Research Ethics Committee (HREC 083/2021) and the Western Cape Provincial Health Research Committee.All antenatal HIV sentinel survey and DHIS data were received as aggregates.Data from the PHDC were de-identi ed before release for the study according to the Western Cape Department of Health Data Access Policy Guidelines.

Results
From 1 January 2011 to 31 December 2020, 977 800 and 989 568 pregnancies were enumerated by the PHDC and DHIS, respectively.Table 1 presents study population size per year compared with sample sizes of both provincial and national antenatal surveys.As compared to the DHIS, fewer pregnancies were enumerated by PHDC prior to 2015.
Amongst women presenting for their rst antenatal visit as recorded in the DHIS aggregated data, the percentage known to be living with HIV increased from 9.5% (95% CI 9.4-9.7%) in 2014 to 15.0% (95% CI 14.8-15.2%) in 2020.Over the same period, the percentage of women accepting an HIV test at rst antenatal visit declined from 90.4% (95% CI 90.2-90.6%) to 81.0% (95% CI 80.8-81.2) (Table 1).The characteristics of the women participating in the PHDC cohort (2014-2020) and the 2014 antenatal survey (unweighted) are shown in Table 2.As seen in Table 1 and Fig. 1, PHDC HIV prevalence estimates from 2014 onwards are closely aligned to national, provincial and DHIS data.*Gravidity estimates (number of times a woman has been pregnant, including current pregnancy) in the PHDC are not reliable since historic data are incomplete.Electronic evidence of current and prior pregnancy is used as a proxy to provide a full description of the cohort.

**Unweighted
The average difference in antenatal HIV prevalence between the PHDC and other datasets are shown in Table 3.A positive correlation was observed between PHDC and DHIS (r = 0.8).HIV prevalence estimates remained stable over time among younger age-groups (aged 15-29 years) but increased among older age-groups (> 34 years).Prevalence estimates between PHDC and national antenatal survey data were closely aligned by age group (Fig. 2).

Discussion
This is the rst study to compare sentinel HIV seroprevalence surveys with both routine aggregated and individuated data.Our results show that from 2015 onwards, the PHDC provides a reliable source of individuated data for accurate and timely antenatal HIV surveillance at provincial, district and age-group levels.These are essential for a responsive health system to plan and evaluate programmes.
From 2015 onwards, the PHDC dataset enumerated more pregnancies than re ected in routine aggregate antenatal rst visit registers (DHIS), supporting the completeness of these individuated data.The lower enumeration of pregnancies prior to 2015 is most likely due to limited availability of electronic data for linkage in earlier years, when electronic data systems were less widely established.As more public health facilities began using routine electronic data systems, patients with any contact with public health facilities would have some electronic record enabling inclusion in the PHDC.Given that the routine individuated data represent the entire population utilising public health services, these data are more representative of the population than survey data.It is also likely that linked individuated data are less prone to the biases of routine aggregated data such as duplication and lack of completeness, both numerically and within available records.A single unique identi er across the province prevents duplication of records where patients attend different facilities.Linkage of different electronic systems using a unique identi er also allows more opportunities for ascertainment of pregnancy and HIV status than xed variables used in aggregate data.
Comparison of descriptive characteristics between PHDC cohort and survey cohort enabled validation of the PHDC cohort prior to estimating HIV prevalence.The median age of women in the PHDC cohort (26.9 years) was similar to that in the antenatal survey population (3,6), supporting the close alignment in demographic pro le of the two cohorts.PHDC, however, had a higher proportion of women with no prior evidence of pregnancy (60%) when compared to recorded gravidity in the survey (approximately 30%) (3,6).This is likely due to less well-established electronic systems in earlier years for prior pregnancy ascertainment.Furthermore, the PHDC cohort had a higher proportion of patients from the Cape Metro district and lower proportion from the rural districts.This is also likely due to increased PHDC coverage in the Cape Metro, particularly in earlier years as electronic systems were better established in urban areas.The national surveys may further include oversampling of rural districts to generate accurate district-level estimates.We would suggest whole population surveillance negates a need for oversampling.
HIV prevalence estimates in both national and provincial antenatal surveys remained consistent between 2011 and 2015, with provincial surveys providing more precise estimates due to the larger sample size.From 2014 onwards, PHDC estimates are closely aligned to the survey.Given the wide con dence intervals for survey estimates, trends are di cult to infer.DHIS HIV prevalence estimates for 2011 and 2012 were consistent with survey estimates, as shown in a comparative study (4).This study, however, used HIV-service speci c denominators from DHIS viz.PMTCT initial HIV test acceptance or refusal.From 2014 onwards, refusal of PMTCT initial test was no longer recorded hence rst antenatal visit was used as the denominator, resulting in lower estimates than in earlier years.DHIS data, however, showed an increasing proportion of women over time to be presenting for their rst antenatal visit and known to be living with HIV.This increase corresponded with a decreasing proportion of women accepting HIV testing at rst antenatal visit.The lower proportion of test acceptance is most likely due to increasing awareness of HIV positive status, in keeping with a recently published modelling study (23).Additionally, some women with known HIV positive status may still retest in antenatal settings, as seen in a recent study in the Western Cape (24).The change in DHIS elements likely represents the changing HIV context with greater awareness of HIV status and wider rollout of antiretroviral therapy (ART) over time (23).Variability presented in DHIS estimates is likely, in part, due to challenges in maintaining routine aggregate data systems where contextual factors, such as staff turnover, high workload, and social unrest impact data collection and consolidation from registers.
PHDC HIV prevalence estimates 2011 to 2013 were lower than survey estimates, most likely due to incomplete electronic data.From 2014 onwards, estimates were better aligned and more consistent over time.As PHDC estimates do not distinguish between pregnancies with or without antenatal care, prevalence was expected to be slightly higher than that estimated using DHIS and survey data, as these data are limited to pregnancies with antenatal care.Antenatal rst visit coverage is however high in the Western Cape at 94% and therefore the contribution of pregnancies without prior antenatal care to HIV prevalence is small in the PHDC estimates (13,25,26).Reassuringly, PHDC estimates remained within the 95% con dence intervals of the national survey, and PHDC and DHIS estimates showed a positive correlation and low average difference.It should be noted, however, that average difference may mask the volatility of prevalence trends over time.Low average difference between survey and PHDC estimates suggest the PHDC may be reliably used to estimate antenatal HIV prevalence at provincial level.PHDC estimates over time were also similar to the Thembisa model estimates.Since the Thembisa model includes both private and public sector data, lower estimates than the PHDC were expected (re ecting lower HIV prevalence among private patients) (27).
At a more granular level, from 2015 onwards, estimates from PHDC were closely aligned to both DHIS and survey at district level, with disparities noted more in sparsely populated rural districts like Overberg.These disparities may re ect survey undersampling in smaller rural districts as well as differences brought about by migrant populations such as workers(28

Conclusion
Our study demonstrates the validity of routine individuated data for timely and e cient HIV antenatal surveillance, without the additional cost and logistical complexity brought about by regular surveys and with fewer biases compared to routine aggregated data.We highlight the added utility of routine individuated data in providing more granular estimates than sentinel surveillance at district and sub-district level, thereby facilitating more detailed and timely population-level epidemiological trend analysis.While provincial antenatal HIV prevalence trends have increased slowly over time, notable differences in district level trends require further investigation.Strengthening of routine individuated data systems will create an actionable platform to support service delivery and allow richer, more e cient, less costly and more timeous HIV surveillance.

Supplementary Files
This is a list supplementary les associated with this preprint.Click to download.

Figure1S.tif
Figure2S.tif estimates are shown in Supplementary Fig. 1.Prevalence estimates between DHIS and PHDC were closely aligned from 2015 to 2020.PHDC and national antenatal survey district estimates were also closely aligned in 2017 and 2019 in all districts except Overberg, with an average percentage point difference of -4.4 from 2015 to 2020.

Table 1 :
Antenatal HIV prevalence estimates by dataset †2015 national survey included additional Western Cape data and was limited to women presenting for rst antenatal visit.‡2011 and 2012 DHIS estimates denominator was total number of women refusing or accepting PMTCT initial tests whereas later estimates used the denominator of total antenatal visits *Prevalence estimate is among rst antenatal visit only.2017 reported prevalence including both rst and follow-up attendees is 15.9% (95% CI 14.2 -17.8).*Prevalence estimate is among rst antenatal visits only.2019 reported prevalence including both rst and follow-up attendees is 17.9% (95% CI 16.2 -19.7).§95% Con dence intervals are not reported for routine data estimates as the whole population is included precluding the need for sampling **Gravidity estimates (number of times a woman has been pregnant, including current pregnancy) in the PHDC are not reliable since historic data are incomplete.Electronic evidence of current and prior pregnancy is used as a proxy to provide a full description of the cohort.**Unweighted Since DHIS cannot provide age-disaggregated estimates, comparisons with PHDC and survey could not be made.Furthermore, in recent years, estimates at sub-district level are only possible with the PHDC data as the national survey is underpowered at sub-district level.This again highlights the advantages of linked individuated data over both survey data and aggregate routine data in providing granular estimates, not limited to predetermined indicators.LimitationsA comparative study of this nature is subject to several limitations.Firstly, routine data were validated against sentinel surveillance data, with national surveys serving as the most accurate HIV prevalence estimates or "gold standard".These survey estimates are however based on smaller sample sizes, over a limited time period, and underpowered for granular estimates which may impact accuracy.PHDC data were less reliable before 2014 due to incomplete electronic data in earlier years.Estimates from each dataset are derived using differing numerators and denominators with differing levels of quality and completeness.Furthermore, both individuated and aggregated routine data are subject to various pitfalls such as capturing errors, administrative errors and consolidation errors which may impact quality and completeness of these data.Pregnancy ascertainment may differ between districts and sub-districts due to differing use of electronic information systems.Since detailed patient characteristics such as socio-economic status and education level are not captured routinely, in-depth comparisons with survey data were not possible.Contextual factors impacting on observed trends in routine data are diverse and require further investigation -these include clerical changes in capturing approaches, widespread impact of the COVID-19 pandemic on service utilisation and sta ng, migration patterns etc. Survey and routine data sources evaluated in this study include only patients utilising public health care services, excluding those in the private sector and those without access to public health care.Lastly, while the Western Cape province has established individuated data systems, most other provinces in South Africa are still reliant on aggregated register-based data.The results of this study are therefore not representative of the whole country and highlight the need and potential to strengthen individuated routine information systems for improved surveillance.
). Age-group HIV prevalence estimates from the PHDC were closely aligned to the national survey from 2015 onwards, showing consistent HIV prevalence in all age groups over a 5-year period, with higher estimates in older age groups.Higher prevalence in older age groups is expected due to a combination of factors, including increased cumulative incidence with age, and use of ART extending life expectancy.
The study was approved by the University of Cape Town Human Research Ethics Committee (HREC 083/2021) and the Western Cape Provincial Health Research Committee.All antenatal HIV survey data, DHIS and Thembisa data were provided in aggregated form.Verbal consent was obtained from survey participants as per survey protocol, however only aggregated survey data were provided for this study.
DeclarationsEthics approval and consent to participate Figures Page 14/15