The added value of recent-infection testing in population-based HIV surveys

Background There is no clear consensus on how best to use increasingly available data derived from large population-based surveys featuring HIV infection status ascertainment. In particular, for the purpose of estimating HIV incidence, there is considerable scope for better elucidation of the benefit of adding ‘recent infection’ ascertainment, which adds considerable additional cost and complexity to surveys which are already costly and complex. Methods Using an epidemic/survey simulation tool developed for this and some closely related investigations, we explore the value added by ‘recent infection’ data from population surveys, to support HIV incidence estimation. This directly piggy-backs on to two companion pieces which have explored, independently, the use of the ‘synthetic cohort’ paradigm of Mahiane et al ( analysing age/time structure of prevalence, in conjunction with estimates of mortality) and the paradigm of Kassanjee et al (focusing on ‘recent infection’ data). marginal focused the early after sexual debut, taken a core sentinel group which efficient than attempts to cover all ages; contrast, when only cross


Introduction
A global HIV epidemic has been raging for four decades, and still there is no clear consensus on how best to estimate HIV incidence: i.e., the rate of new infections in a population. Estimating prevalence (the proportion of infected individuals in a population) is relatively straightforward, but not nearly as informative, especially about the recent impacts of interventions, policies, and changing social norms. Incidence estimation for chronic conditions is in general difficult -unlike for transient conditions, for which prevalence and incidence are simply related.
Large scale population-level cross-sectional surveys that include HIV status determination, and in many cases also ascertainment of 'recent infection' as defined by objective laboratory procedures, have been conducted in many Sub-Saharan countries, and have become a/the headline data source for epidemiological assessments at the national and supra-national regional level. Within the last two decades, variations of such surveys have been executed multiple times in numerous countries, leading to rich data sets tracking the prevalence of HIV infection and the 'prevalence' of 'recent infection' among confirmed HIV positive subjects, over time and by age (1)(2)(3)(4).
In two companion articles (5,6), we have systematically explored the optimal extraction of this age/time structure from population survey data. To recap:  We deployed a comprehensive demography/epidemiology/survey simulation platform which we use again in the present work, and which is separately outlined in more detail separately (7).  We proposed a generic approach to age/time regression in order to use the approach of Kassanjee et al (8), which crucially relies on ascertainment of 'recent infection', to leverage analysis which is inspired by the simple relationship between incidence and prevalence for transient conditions.  We demonstrated the applicability of a similar generic regression approach to the estimation of incidence by the approach of Mahiane et al. (9), which crucially relies on the estimation of a 'prevalence gradient', in conjunction with the estimation of a specifically defined 'excess mortality/attrition' for HIV positivesan approach which falls under the broad umbrella of 'synthetic cohort' analysis.
Increasingly, numerous countries, or subnational regions, have data which allows the applications of both the Kassanjee and Mahiane framework. The question which then naturally arises is how best to combine the two methods, which provide nominally separate estimates that are however correlated in complex ways as they both rely on the same underlying serostatus data which always comprises the bulk of the data set. For the present analysis, we view this question through the lens of the benefit of the recency data, seen as an add-on to the main prevalence data set. This reflects the points that  there is no sensible survey design that generates recency data but not prevalence data, and  at the design stage, before data is available to analyse, one will want to be clear about the benefit of performing the recency ascertainment, which invariably imply substantial increases in both the cost and the complexity of surveys that are already major undertakings even without this requirement.
In outline, the present work has the following high-level components: 1. Simulating demonstrative epidemics, defined by incidence and mortality, leading to an emergent (age-, time-and time-since-infection-structured) population state. 2. Simulating realistic multiple cross-sectional surveys, where 'recent infection' is defined by a probability of testing 'recent' (on some algorithm) which depends explicitly on a function of timesince-infection in a way that is inspired by actual available tests of this kind. 3. Applying various smoothing algorithms to the survey data, in order to extract age and time specific estimates of prevalence of HIV, and prevalence of 'recent infection' amongst HIV positives.

5.
Evaluating the relative merits of the various combinations of approacheswhich we are able to do by comparing the estimates with the known incidence parameter values which were used in the simulations. 6. Proposing guidance on the use of, and value added by, 'recent infection' ascertainment (for the purpose of HIV incidence estimation).

Methods
As noted, we are building on work reported in two companion pieces to this one, based primarily on the simulation of a number of cross-sectional surveys in a South-Africa-like epidemic. We have already systematically investigated ways to adapt the methods of Mahiane et al (9) and Kassanjee et al. (8), to estimate incidence based on survey data from one or more cross sectional surveys, and incidence differences for cases with two/more cross-sectional surveys.
The functional forms of each of the incidence estimators are Where is the prevalence of HIV, is the gradient of the prevalence as seen from the point of view of a cohort of individuals of identical age, is the prevalence of 'recent infection' (a.k.a. recency) among the HIV positive subjects, is the Mean Duration of Recent Infection (MDRI), β is the false recency rate (FRR), and is the time cut-off for being classified as recently infected without being 'falsely' recent. In our simulations, the Mean Duration of Recent Infections (MDRI), false recent rate (FRR), and differential mortality are known exactly, because they are explicitly specified, or emerge from (and are evaluated in) the simulation platform.
To combine the information from the two estimators, we first define a general weighted average of the two estimators: We find the optimal weight by differentiating equation 4 with respect to W and setting that to zero: Where, 2 is the variance of , 2 is the variance and CoV(I M , I K ) is the covariance of , and . According to delta method analysis (10,11) the ( , ) is given by; The derivatives of the estimators are given by The covariance can be estimated either by equation 6, or by repeatedly simulating the survey (for example 10,000 times) or resampling from a particular data set (i.e. bootstrapping) and for each iteration estimating and , and hence estimating the COV ( , ) from the iterates.
Stable approaches to the smoothing of survey data to estimate the prevalence , the prevalence of recency , and crucially the gradient of prevalence, , were discussed in-depth in the two preceding companion papers. In short, a "one size fits most" approach can be summarised as follows:  Use generalised linear models (GLM) to fit, in turn, the serostatus and the recency data, with either third or fourth order polynomials in age and time.  Repeat the fitting procedure for each age and time for which incidence estimates are to be obtained, including data points by a simple proximity rule such as being within some (temporal) 'distance' to the age of interest.  Use a logit or identity link function for fitting and a logit or complementary log log link function for , with some age or age/time inclusion-distance rule.  By default, we settled on using a cubic order polynomial with an inclusion distance of 6 years and link functions logit for and complementary log-log for . Single cross-sectional surveys.
In addition to the usual semi-realistic 'South Africa -like' scenario, we also simulated a stable epidemic with a calendar-time invariant (but age dependent) incidence function, and also used a calendar-time invariant excess mortality (resembling a 'no treatment' scenario).
Two cross-sectional surveys.
Realistically, two cross sectional surveys may utilise different 'recency' ascertainment tests, leading to a different values for MDRI and FRR, as these two parameters are context specific (12)(13)(14). Hence, to avoid this distraction for the present purposes, surveys are simulated with the same recency test.

Incidence trends.
Incidence trends are a crucial indicator of whether interventions or emergent changes in habits and services are reducing the transmission of HIV. To investigate the prospects for estimation of an incidence trends two cross sectional surveys. We show how to yield accurate and informative age specific and age range incidence difference estimates and the effect of sample size on the precision of the estimates.
In cases where we attempt to estimate incidence difference from two cross sectional surveys, we estimate age specific incidence at the two survey dates using a shared estimate of in both estimates.

Results/Discussion
Single cross-sectional survey Figure 1 shows the incidence estimates from a single cross-sectional survey in a scenario in which there is no time dependence to any parameters or prevalences.
The key point appears to be that even when the correct value of is provided, the highest and most age dependent values of incidence are not being estimated without significant bias by the Mahiane estimator, i.e., when the recent infection data is being ignored. In practice, sample size (or sampling density) is likely to be smaller, and the bias shown here may be substantially swamped by poor precision.
Midpoint incidence estimates comparison ( , and ). While a logit link function for prevalence provides some stability by automatically constraining the prevalence to values between 0 and 1, it appears that an identity link function may offer superior fitting at various epidemic stages, so this should be explored in simulations adapted to mimic any context in which there has been a major investment in data of this kind.
These results also show the consistent trend that for young ages the Mahiane estimator provides most of the information about incidence, and for older ages the Kassanjee estimator provides most of the information.
Comparison of methods for estimating the optimal weight (delta method vs bootstrap) We compared the two approaches of calculating (an analytical delta method versus the numerical bootstrap approach) and their effect on and the resulting standard errors. The results are shown in Table 1.
There is no substantial (indeed hardly any) difference between the estimates derived from the bootstrap approach and the analytical approach. The concordance of both the standard error and the realised point estimates shows that for computationally intense investigations, the delta method is a good proxy to estimate the standard error. On the other hand, once a major investment has been made in a complex survey, there is obstacle to implementing an ultimately more robust bootstrap based calculation.
Sensitivity of the standard error to W (midpoint) Figure 4 expresses the relative standard error of as a function of the normalised weight (W) for all 5 epidemic stages and selected ages, there is no sharply defined optimal weight required to estimate . For example, the relative error at age 20 is almost flat for a range of W values (0 to 0.5), and hence any value between 0 and 0.5 yields much the same value of . The weighting scheme in early epidemics (1992.5) somewhat favours and as the epidemic matures, and at older ages, the weighting scheme favours .
Incidence estimates at Survey times Figure 5 and Figure 6 show incidence estimates at the cross-sectional survey dates, derived from combining two cross sectional surveys. The cross-sectional surveys are simulated from particular epidemic stages: either an increasing incidence (between 1994.5 and 1999.5) or a declining incidence (between 2010.5 and 2015.5). For comparison, we once more show the use of both an identity and a logit link function for fitting prevalence.
Incidence estimates ( ) from the survey dates are more precise compared to the midpoint incidence estimates, in Figure 2 and Figure 3, probably because incidence is being estimated where the data points actually is, unlike the midpoint incidence estimates. But this comes at the cost of accuracy -the incidence estimates ( ) are biased at the cross-sectional survey dates due to the challenges of estimating the gradient of prevalence away from the mid time of the data set. Note: just one model is fitted simultaneously to both cross sectional survey datasets (which is not the conventional use for recency data); and is fundamentally designed to estimate the midpoint incidence and not the incidence at the crosssectional survey dates.

Two surveys
Our attempts to estimate incidence trends/difference from two cross sectional surveys, using all 3 approaches , , and are shown in Figure 7. Apparently, estimating incidence differences using the Mahiane et al approach requires luck, as it is mostly biased even if they are precise, while incidence difference estimates from are unbiased if not highly informative. It would seem that all the usable information is in the Kassanjee estimate, and a variance minimising is not necessarily of any additional value, given the exposure to substantial bias. Figure 8 shows incidence difference estimates, based on 3 cross sectional surveys when incidence is steadily rising (1993, 1998, and 2003) and also when incidence is in steady decline (2005, 2010, and 2015). As expected, both the primary approaches ( and ) yield accurate incidence difference estimates that closely track the incidence difference at all ages, though they are uninformative, in turn, at various ages. Once again, the additional effort of obtaining recency data mainly improves the estimates at older ages.

Three surveys
We can improve the precision of the incidence difference estimates by adding the post-hoc age averaging (see Figure 9) which we previously introduced in our companion piece (15) based on two cross sectional survey with recency ascertainment. Figure 9 compares the post hoc age averaging for selected age groups to the age specific incidence difference of the central age of that age bin. Generally, the incidence difference estimates at the selected age bins are accurate and most importantly the post hoc averaging yields is significantly more informative for all methods, compared to the age specific incidence difference estimates. Note that the age-weighted is consistently distinguishable from 0, but the less sophisticated estimates are not.

Conclusion
In our preceding companion pieces, we explored the fine points to consider when estimating , / , and for use in each of the incidence estimators Mahiane et al., (9) and Kassanjee et al., (8). This present work explores the benefits of combining and into a (variance) optimised weighted average. We have done this primarily from the point of view of asking what additional benefit is obtained in having the recency data.
With the additional insights gained from the present work, we now regard it as a straightforward matter to implement contextually adapted versions of a well-defined stable approach that consistently yields nearoptimal extraction of HIV incidence estimates, based on whatever data is available from substantial population-based surveys of the kind which are being performed on a large scale in the heavily HIV affected countries of sub-Saharan Africa.
The question of whether to expend resources on adding recency ascertainment to large population-based surveys presents us with a difficult quandary. In general, reliable informative incidence estimation requires very large sample sizes (i.e., very high sampling densities across some age range) and works best when incidence is very high. This, coupled with the epidemiological/sociological importance of incidence among the young, suggests, as we have previously noted (6), that one consider focusing on this group as an informative and important sentinel population, rather than attempting to obtain incidence estimates for all ageswhich may simply not be feasible. For these younger ages, recency ascertainment does not really improve single time point estimates. However, we are usually even more interested in incidence differences and trends, than in single estimates, and we have seen that difference estimates based on just two survey rounds are not stable without recency data. By the time one has three rounds of major household surveys, and is in a position to obtain a robust incidence difference estimate without recency data, the better part of a decade will usually have elapsed from the first survey, and the incidence difference estimate will refer to a trend that was applicable to the epidemic some years in the past.
These considerations suggest that before embarking on a multi-year high budget commitment to one or more major surveys with intent to estimate HIV incidence, it is worth investigating the specific situation by means of carefully adapted simulations in which various designs can be simulated, and the specific analysis for burning epidemiological questions can be explored. For example, one may consider surveying just young women (age 15-30, for example) and pursuing the headline estimate of mean incidence in the age group 20-25. Recent infection testing will not yield impressive incidence estimates from one survey round, but without recency testing, there will be very little evidence on incidence changes even after two surveys at which point the mean incidence estimate over this time will be largely driven by a Mahiane analysis.
There are other detailed loose ends we have not systematically investigated, such as:  The impact of non-zero values for false recent rate: While it is fashionable among some analysts to presume that FRR is always zerothis is not a safe bet, and there should always at least be a sensitivity analysis on this point.  Multiple estimates of recency test properties: When there are multiple surveys which each perform some sort of recent infection testing, it is not obvious that the MDRI and FRR of the test or tests should be taken as having precisely the same value in each survey round. In practice, the best estimates of these test properties may be weakly or strongly correlated, depending on whether the difference is primarily one of choice of assay or epidemic context. These kinds of additional considerations are not just minor points, and they may warrant very careful investigation in some variation of the analyses we have been describing. Fortunately, the simulation and analysis code we have developed for our present purposes, which is available upon request, can be flexibly and straightforwardly used to adapt the analyses we have presented to many finely specified alternative scenarios. to the true (simulated) incidence. Each survey has a sample size of 4000 per 5-year age range with link functions logit for and c log-log for , using a cubic order polynomial with an inclusion distance of 6. The input incidence function is time invariant and the excess mortality function simulates no treatment. (1992, 1997), and (1997, 2002)).

Figure 2: Midpoint incidence estimates from pairs of simulated cross-sectional surveys
The estimates are based on the Kassanjee estimator, the Mahiane estimator, and the optimally weighted incidence estimators, as shown. Generous sample sizes of 4000 per 5-year age range were used either with identity or logit link functions for as shown in column labels. A c log-log link function was used throughout for . All regressions used a cubic order polynomial (in age, truncated at linear in time because there are only two time points across all observations in any given regression) and an observation inclusion distance of 6 years in the age direction. The estimates are based on the Kassanjee estimator, the Mahiane estimator, and the optimally weighted incidence estimators, as shown. Generous sample sizes of 4000 per 5-year age range were used either with identity or logit link functions for as shown in column labels. A c log-log link function was used throughout for . All regressions used a cubic order polynomial (in age, truncated at linear in time because there are only two time points across all observations in any given regression) and an observation inclusion distance of 6 years in the age direction.   The incidence estimates are derived from fitting one model to two cross sectional surveys simulated in an epidemic stage with an incidence function that is rapidly increasing in time (1994.5, 1999.5). The surveys each has a sample size of 4000/5 year age bin. The fit was done using a cubic polynomial with an inclusion distance of 6. Each column depicts the link function (logit vs identity) used for the fitting .
is fitted using a clog-log link function. The incidence estimates are derived from fitting one model to two cross sectional surveys simulated in an epidemic stage with an incidence function that is steadily decreasing in time (2010.5, 2015.5). The surveys each has a sample size of 4000/5 year age bin. The fit was done using a cubic polynomial with an inclusion distance of 6. Each column depicts the link function (logit vs identity) used for the fitting .
is fitted using a clog-log link function. The plot depicts the incidence difference estimates from two pairs of cross-sectional surveys each depicting a particular epidemic stage. Each survey has a sample size of 24000 (4000/5-year age bin) either a logit /identity link functions (columns) are used to estimate P and clog log link function for R. The plot shows the incidence difference estimates from two epidemic stagesrapid increase (1993, 1998, and 2003) andrapid decrease (2005, 2010, and 2015). The incidence estimates are calculated from the midpoint of the two consecutive surveys and consequently the difference between the two incidence estimates is calculated. and were fitted using a logit and clog log link functions respectively. The 95% range is estimated through 10000 bootstrap samples The plot shows the incidence difference estimates from two epidemic stagesrapid rise (1993, 1998, and 2003) andrapid decline (2005, 2010, and 2015). The incidence estimates are calculated from the midpoint of the two consecutive surveys and consequently the difference between the two incidence estimates is calculated. and were fitted using a logit and clog log link functions respectively. The 95% range is estimated through 10000 bootstrap samples.