A systematic review of endpoint de nitions in late phase tuberculosis therapeutic trials


 Background. Safe, more efficacious treatments are needed to address the considerable morbidity and mortality associated with tuberculosis (TB). However, the current practice in TB therapeutics trials is to use composite binary outcomes, which in the absence of standardization may inflate false positive and negative errors in evaluating regimens. The lack of standardization of outcomes is a barrier to the identification of highly efficacious regimens and the introduction of innovative methodologiesMethods. We conducted a systematic review of trials designed to advance new TB drugs or regimens for regulatory approval and inform practice guidelines. Trials were primarily identified from the WHO International Clinical Trial Registry Platform (ICTRP). Only trials that collected post-treatment follow-up data and enrolled at least 100 patients were included. Protocols and Statistical Analysis Plans (SAP) for eligible trials from 1995 to the present were obtained from trial investigators. Details of outcome data, both explicit and implied, were abstracted and organized into three broad categories: Favorable, Unfavorable, and Not Assessable. Within these categories, individual trial definitions were recorded and collated, and areas of broad consensus and disagreement were identified and described. Results. From 2205 TB-related trials, 51 were selected for protocol and SAP review, from which 31 were both eligible and had accessible documentation. Within the three designated categories, we found broad consensus in the definitions of Favorable and Unfavorable outcomes, although specific details were not always provided, and when explicitly addressed, were heterogeneous. Favorable outcomes were handled the most consistently but were widely variable with respect to specification. In some cases, the same events were defined differently by different protocols, particularly in distinguishing Unfavorable from Not Assessable events. Death was often interpreted conditional on cause. Patients who did not complete the study because of withdrawal or loss to follow-up presented a particular challenge to consistent interpretation and analytic treatment of outcomes.Conclusions. In a review of 31 clinical trials, we found that outcome definitions were heterogeneous, highlighting the need to establish clearer specification and a move towards universal standardization of outcomes across TB trials. The ICH E9 (R1) addendum provides guidelines for undertaking and achieving this goal.Registration PROSPERO 2020 CRD42020197993


Introduction
Tuberculosis (TB) kills more people globally than any other single pathogen(1), with mortality and morbidity likely to increase as a result of the COVID-19 pandemic and the many ensuing challenges posed to national TB control programs (2). New shorter, safer and more e cacious treatments are urgently needed (3). In response to this need, more than a dozen new compounds are in early or middle clinical development (https://www.newtbdrugs.org/pipeline/clinical) with numerous late-phase randomized controlled trials expected in the near future, conducted either by individual pharmaceutical companies, or as part of publicly or philanthropically funded networks.
Most recent and ongoing late-phase TB therapeutics trials have used a composite binary outcome that combines bacteriological failure and relapse, death, treatment changes, and loss to follow-up as the primary e cacy outcome. Multiple analysis populations are usually proposed as co-primary. These include an intention-to-treat analysis (ITT) population including all patients randomized, classifying as unfavorable any participants with substantial missing data; a modi ed ITT (mITT) analysis population excluding some losses to follow-up from the analysis, and a per protocol (PP) analysis population excluding participants who had a protocol violation or did not complete a su cient proportion of treatment. This approach has a number of limitations: Not standardized. Outcome de nitions are not standardized across phase III TB treatment trials. This leads to considerable challenges in combining data, interpreting results, assessing comparative e cacy, implementing predictive modelling, and conducting necessary meta-analyses (as exempli ed in the TB-ReFLECT project (4)).
Outdated. The emphasis on simple, unadjusted per protocol analyses (not considering causal inference methods (5)) and even modi ed intention-to-treat analyses with post-randomization exclusions is at odds with best practice in other disease areas(5, 6) and regulatory guidance (7). The draft version of the FDA guidance document for non-inferiority trials (2010) initially accommodated an "as-treated" analysis, but this was removed in the nal guidance document (2016) (7).
May in ate Type I and II errors. Classifying the outcome of participants lost to follow-up as unfavorable (i.e., de ning "missing" as "failure") is likely to result in conservative estimates in superiority trials by diluting any treatment effect (and is therefore often favored by regulators). This is not necessarily conservative in a non-inferiority trial, can in ate type I and type II errors, and also results in mis-leading decisions in the context of adaptive platform trial designs.
A barrier to identifying highly e cacious regimens. Including events that are less likely to be related to treatment (including loss to follow-up and non-TB mortality) in a composite outcome increases variability in treatment effect estimates and therefore necessitates an increased sample size. This added "noise" also makes it challenging to identify interventions (like strati ed medicine approaches (4)) that may result in very high cure rates (97%-100%) without requiring prohibitively large sample sizes(8).
At odds with policy makers and guideline developers. WHO guidelines generally rely on WHO programmatic outcomes de nitions (9) when considering evidence (the 2018 DR-TB guidelines is a case in point (10)). The "catch-all" nature of the composite outcome currently used in phase III trials is likely to have contributed to this disconnect between trials and the approach taken for guidelines.
Mixes e cacy and safety events. Including treatment changes due to adverse events during treatment in the composite outcome con ates safety and tolerability with e cacy. Impedes progress in prediction modelling. A phase III outcome de ned by composite events does not allow for e cient and predictive linkage with phase IIB endpoints, such as time to culture conversion, that are essential for bridging the gap between phase II and phase III trials (11), and that will be increasingly important as new biomarkers of TB treatment response are identi ed (12). Similarly, translational modelling across the species (NHP, mice, rabbit) is limited due to discordance in outcomes, enabling translational errors and suboptimal decision-making which regimens to advance in clinical development.
Furthermore, regulatory guidance is changing with the ICH E9 (R1) addendum on Estimands and Sensitivity Analyses ( nalized November 2019), which formalizes a new approach to specifying trial objectives, endpoints and analysis populations (collectively called the Estimand).
With new late-phase trials expected on the horizon, it is therefore vital to carefully consider and re ne the phase III primary e cacy outcome, making use of the language proposed in the ICH E9 (R1) addendum. The objective of this systematic review was to rst catalogue phase III outcome de nitions (including analysis populations and primary objectives) from recent phase III trials for new regimens for drug susceptible (DS) and drug resistant (DR) tuberculosis, and then to conduct a thematic analysis on these outcomes to identify areas of consensus and disagreement. The overarching goal of this work is to use these results to develop standardized consensus estimands for phase IIC and III TB therapeutics trials.

Methods
The protocol for the systematic review was prospectively registered on the PROSPERO registry (PROSPERO 2020 CRD42020197993) (13) and is provided as an online supplement, along with the PRISMA checklist (14) Brie y, this systematic review sought to identify trials that have been designed to advance a new drug or regimen for regulatory approval and therefore inform and impact practice guidelines. The focus was on phase III and other late-phase randomized controlled trials (RCTs), or non-randomized trials of new drugs intended speci cally for regulatory approval. Trials of treatment for latent TB, prevention of TB, diagnosis of TB, extrapulmonary TB, adjuvant nutritional supplements or immune therapies, ART initiation among TB patients, and trials of TB vaccines and programmatic interventions looking at adherence interventions (DOT or mHealth initiatives) were excluded as endpoints in these trials are de ned differently. Trials that did not collect outcome data on post-treatment follow-up (for relapse) or that enrolled fewer than 100 patients were excluded since these were clearly not designed to change guidelines and practice.
The WHO International Clinical Trial Registry Platform (ICTRP) was the primary database searched to identify relevant trials. To increase the likelihood that no trials were missed, we also contacted experts in the eld of TB trials to identify other trials and reviewed the excellent list of DR-TB clinical trials maintained by RESIST-TB (www.resisttb.org).
Two individuals (PPJP and JJL) independently reviewed the list of trials identi ed from the search strategy using titles and other elds from the ICTR platform to determine whether they met the inclusion criteria. Investigators or sponsor representatives of nal selected studies were contacted to access the study statistical analysis plans (SAPs) and study protocols; these were downloaded from the public domain when available. Two individuals (NKH and JJL) reviewed all protocols and Statistical Analysis Plans and abstracted relevant information.
Qualitative data from primary endpoint de nitions of different studies were analyzed using thematic analysis in the ve stages outlined by Braun and Clarke(15) Qualitative analyses and summaries were done by NKH. The nal draft of the manuscript was circulated to PIs of all completed trials for their comments, approval and edits. Our objective was to describe areas of consensus and disagreement as drawn from protocols and SAPs across trials, rather than to critique individual trials. For this reason, we do not discuss nuances in de nitions in speci c trials, but rather aim to provide a summary of broad trends in outcomes de nitions and analyses used in recent TB treatment trials.

Results
Due to heavy tra c generated by the COVID-19 pandemic during early 2020 and limited ability to use the on-line search portal for the WHO ICTR, we downloaded the full ICTR database (3.5GB, 19 May 2020) and used it for this systematic analysis. From 632,787 clinical trials registered, we identi ed 2205 with condition containing 'tb' or 'tubercul' and selected 510 for independent registry review by two reviewers.
All registry information was available in English. From these, we identi ed 51 trials that were highly likely to be relevant and eligible for inclusion (See Figure 1 for PRISMA ow diagram (14)). We then contacted Principal Investigators of the selected trials to request the most current versions of their protocols and, when possible, SAPs. We received protocols from 31 studies, and SAPs from 18 (58%). Many trials were listed on more than one trial registry; the majority (27, 87%) were listed at least on clinicaltrials.gov; two of the studies was only listed on ISRCTN registry (isrctn.com), and two trials were only listed on Clinical Trials Registry of India (ctri.nic.in). Registration across all trials was nalized between the years of 2001 and 2020 (although in some early cases, trials were not registered until after completion; the earliest trial began enrolling in 1995 but was not registered until 2001), with 21 (68%) of the trials registered during or after 2010. All protocols were available in English. Twenty-six of the trials (84%) were phase III, either with (n=29) or without (n=2) internal controls; one trial was described as phase IIB/III, two were listed as phase IIC, and two as phase IV (see Table 1).
Ten of the trials targeted patients with drug-resistant TB (DR-TB) and the remaining 21 trials enrolled patients whose TB was drug susceptible (DS-TB). Two protocols included patients diagnosed with either DS-or DR-TB, although in each case those with DR-TB were enrolled as a non-randomized interventional cohort that was not "statistically analyzed." Five (17%) of the trials included participants enrolled in African sites, eight (26%) included participants enrolled in Asian sites, and 16 (52%) included participants enrolled on both continents. Seven (24%) included subjects in South American sites, two (7%) in Latin America and ve (17%) in North America. Proposed subjects were as young as 12 (one trial), 14 (two trials) and 15 (four trials), although one trial did not impose a lower age limit; however, most trials included patients aged 18 years and older. Two protocols capped the age of participants at 60, ve at 65, one at 70, and another at age 75; in the remaining trials, an upper age limit was not speci ed. Only one trial exclusively conducted in children and adolescents was included in the 51 trials for protocol and SAP review, but the protocol was not made available for inclusion in our review.
The primary objective uniformly across all but one study was to investigate whether a novel treatment regimen had non-inferior or superior e cacy in terms of a "long-term durable cure extending through post-treatment follow-up." In the remaining study, e cacy outcomes were secondary to safety outcomes.
Novel interventions varied across trials, and included shortening treatment, evaluating the e cacy of new combination regimens, utilizing oral medications exclusively, testing different doses and durations of treatment, testing xed dose combination formulations, and simplifying treatment by utilizing intermittent dosing. A non-inferiority analysis comparing a new treatment regimen to standard treatment was speci ed in 18 (58%) protocols, with margins of non-inferiority ranging from 4% to 12%. Other techniques used included equivalence testing (n=6), superiority testing (n=3) and logistic regression to compare differences in proportions of participants achieving a Favorable outcome (or, conversely, an Unfavorable outcome). In 15 (48%) protocols, an intention-to-treat (ITT) or modi ed intention-to-treat (mITT) analysis was de ned as primary, while per protocol (PP) analyses were also planned as secondary or con rmatory analyses. In 14 (48%) studies, the mITT and PP analyses were considered co-primary. In only one of the protocols we reviewed was the PP analysis considered primary; in one other, no speci cation was made (although in this case we did not have access to the trial SAP).
The duration of experimental treatment regimens ranged from 13 weeks to 26 weeks for DS-TB trials, and from 24 to 44 weeks for DR-TB trials. Duration of post-treatment follow-up was of varying lengths, these might be measured as time post-randomization or post-treatment, sometimes in weeks, at others in months. Some protocols speci ed "time windows" around evaluation dates, while others cited only the week or month representing the end of follow-up without explanation as to how much time before or after de ned the follow-up "window." Total trial duration time from randomization to end of follow-up ranged from 78 to 130 week for DS-TB trials, and from 104 to 132 weeks for DR-TB trials. In general, the primary trial outcomes were measured at the end of follow-up. At the time of writing, 8 (26%) trials were still open to enrollment. Seven trials (19%) were complete with study ndings not yet available or in follow-up, and 2 (6%) were completed and had results posted on clinicaltrials.gov. For 14 (45%) trials, the primary results of the trial had been published in a peer-reviewed journal or presented at an international conference.

Outcome De nitions
Outcomes across study protocols were assigned to one of three broad categories: Favorable, Unfavorable, or Not Assessable. Protocols generally de ned an outcome as Favorable in terms of timing of culture conversion and required number of negative cultures at the end of the follow-up period.
Similarly, determination of an outcome as Unfavorable primarily involved the observation of a speci c number of positive cultures with or without reference to a time frame for the samples. All protocols speci ed these bacteriological conditions to some degree, although the circumstances under which determinations were made, and the granularity with which these were de ned in individual protocols, varied considerably (see Supplemental Table 1 for a listing of outcome de nitions found in protocols).
Protocols from recent studies were more likely to allow for categorization of an outcome as Not Assessable if it could not be clearly classi ed as Favorable or Unfavorable, e.g., deaths unrelated to TB, recurrence due to re-infection with a different strain, and loss-to-follow-up with last culture negative. However, in some cases identical outcome-determining events were categorized as Not Assessable in some cases and Unfavorable in others. Protocols from earlier trials seldom speci cally labeled an outcome Not Assessable, although this designation sometimes could be inferred from descriptions of patients excluded from analyses. In others, however, this possibility was neither explicitly nor implicitly addressed. Outcomes determined to be Not Assessable will be discussed simultaneously with Unfavorable outcomes, since the same event could be interpreted as one or the other by different trial protocols. Table 2 summarizes the range of outcome de nitions and the frequency of their occurrence across protocols.
Protocols additionally addressed issues around treatment and adherence with respect to categorization of outcome. These will be considered last, as they often coincide with or contribute to other reasons of categorization of outcomes as either Unfavorable or Not Assessable.

Favorable Outcomes
In contrast to Unfavorable and Not Assessable outcomes, Favorable outcomes received the most consistent treatment across protocols. In all protocols that we reviewed, a patient with a Favorable outcome was de ned as one who tested negative on a varying number of cultures, with reference to the end of treatment and/or follow-up. Nonetheless, this seemingly straightforward outcome underwent a multitude of permutations across trials. Some trials required only that a patient be "culture negative;" others de ned an outcome as Favorable based on a single negative culture. The majority of trials required at least two negative cultures, and in a small number of trials, a patient was required to have three negative cultures to achieve negative status. In addition to the variability in number of negative cultures required, Favorable status was conditional on a variety of restrictions in terms of timing (with reference to either the end of treatment, the end of follow-up or both), spacing (amount of time between the negative cultures that ranged from occurrence on different days to requiring at least four intervening weeks between negative cultures), and culture medium type (solid or liquid).
Spontaneous sputum production usually decreases or resolves during successful treatment and followup for TB and most such patients are culture-negative for M.tb(16). A smaller number of studies addressed a patient's potential inability to produce sputum at various points in the trial as indicative of a Favorable outcome. One protocol interpreted a patient's inability to ever produce sputum as a Favorable outcome; another further stipulated that never producing sputum would be considered Favorable even if the patient never achieved culture negative status but completed follow-up without clinical or microbiological relapse. Others de ned circumstances under which failure to produce sputum at the end of the follow-up period could be classi ed as negative, e.g., provided this coincided with a patient having prior culture negative status or lacking clinical symptoms. In only one trial was failure to produce sputum at the end of follow-up categorized as an Unfavorable outcome. While generally classi ed as a Not Assessable outcome (see below), one study classi ed patients who developed an infection with a strain different from that with which they had originally been infected (an exogenous reinfection) as having a Favorable outcome if the original strain was shown to have been cured. In another, a contaminated culture result or one which could not be evaluated was categorized as Favorable, provided there were no positive cultures at the end of follow-up. Two studies allowed for a patient to have had a Favorable outcome even with a culture at the end of follow-up that was inconclusive, if clinical and radiological symptoms were supportive of the assessment.

Unfavorable Outcomes vs. Not Assessable Outcomes
In the broadest sense, we found that all the reviewed protocols deemed that a patient's outcome would be considered Unfavorable primarily based on positive sputum cultures. However, the level of detail attached to culture positivity varied from the most general ("Failure at end of treatment") to the bewilderingly complex: in one trial, for example, the outcome of a patient not attending the nal visit could not be categorized as Unfavorable until all of four speci ed conditions were met, and two additional conditions had been taken into account.

Categories of unfavorable/not assessable outcomes
Failure to ever achieve negative culture conversion. A patient's failure to respond successfully, as de ned bacteriologically, to the prescribed regimen by the end of the treatment period constituted the most straightforward type of unfavorable outcome. In some protocols, however, the treatment duration could be extended if necessary or if some limited number of treatments had been missed, thus lengthening the time a patient was given to achieve culture conversion or culture negative status.
Relapse and Re-infection. Recurrence of bacterial infection can occur as an endogenous relapse, de ned as a patient's recurrence with positive culture status with the originally diagnosed strain, having previously attained negative status, or as an exogenous re-infection, i.e., a new infection with a different strain. Not all protocols speci cally addressed an analytical approach to both. One protocol did not address either relapse or reinfection; some categorized the status of relapse but not re-infection, and several addressed re-infection but not relapse. Other protocols addressed and categorized both.
Relapse. In all studies, relapse was considered an Unfavorable outcome in terms of its analytical treatment. Although some studies provided speci c de nitions of relapse, others included it as either part of a composite outcome or (in a few cases where patients were required to have been previously treated and cured prior to the study) as the primary outcome. De nitions, when provided, varied as to when and how relapse was de ned, and with what level of detail, however. Some studies de ned a relapse as occurring in patients who were culture-negative at the end of treatment, but with different constraints on the conversion to culture-positive. These included diagnosing relapse in a patient who tested positive twice with no intervening negatives, whose two positive tests occurred at least one day apart, who had positive sputum cultures during four consecutive monthly exams (at least one with 20 or more colonies), or who had a subsequent diagnosis and treatment for the same or another DR strain (in a study targeting DR infections). Similarly, two additional DR studies de ned relapse as having occurred when a patient was prescribed a new DR regimen after treatment and before the end of follow-up. Another study speci ed that a patient's conversion to negative status had to occur over at least four weeks, with subsequent positive status (on solid medium) con rmed by a second positive culture on a different day. Other studies offered less speci c criteria, including simply "recurrence by the end of the study," "after cure, single culture positive," and "one culture positive and clinical features suggestive of recurrent disease." Reinfection. Unlike relapse, patients who acquired an infection with a different type of TB were regarded by most DR-TB and DS-TB studies as having outcomes that were Not Assessable. Only one protocol viewed re-infection with a different strain as Unfavorable; one study targeting patients with DR-TB categorized re-infection with a different DR strain as Unfavorable, but with a DS strain as Not Assessable. As previously mentioned, one protocol categorized a patient's re-infection as Favorable if occurring after a con rmed conversion to negative status with respect to the original strain.
Death. With varying degrees of granularity, most protocols addressed death as an outcome, whether occurring during treatment, after treatment during the follow-up period, or during either. One protocol did not mention death in relation to outcome, and another mentioned death only in that it precluded a Favorable outcome; we were unable to obtain SAPs for either of these studies. The death of a patient was generally categorized as an Unfavorable outcome, although under certain speci ed circumstances, deaths could also result in study outcomes being considered Not Assessable.
Death during treatment. A patient's death during treatment could fall into one of the following categories: (1) death due to any cause, (2) death directly related to TB, and (3) death due to causes unrelated to TB. Non-TB deaths were categorized differently across studies; some considered these to be Not Assessable, while more frequently, studies treated them as Unfavorable, with the exception of deaths due to accident, violence, trauma or suicide (with the exception of suicide, these latter were generally classi ed as Not Assessable). Death by suicide was speci cally addressed in a third of the protocols, but was considered by some as Unfavorable, and by others as Not Assessable. An additional protocol speci ed that the outcome of a patient whose death during treatment was unrelated to TB, but whose culture status at the time of death was unknown, would be classi ed as Not Assessable.
Death during post-treatment follow-up. During the post-treatment follow-up phase, "all cause deaths" (without further differentiation) were regarded as Unfavorable outcomes in some studies, while in others, deaths were only considered Unfavorable if TB-related. A small number of studies considered a generalized category of non-TB deaths to be Not Assessable for purposes of analysis. In several studies, the treatment of death during follow-up was determined with respect to bacteriological status. Several trial protocols classi ed the outcomes of patients who died with their last culture negative as Not Assessable. Additional studies more speci cally proposed that deaths be considered Not Assessable only if a patient died while culture negative, under the condition that the last positive culture had been followed by two negative cultures at least seven days apart. In addition, one study speci ed that a patient who died from extrapulmonary TB would be considered as having an Unfavorable outcome; another classi ed patients whose deaths were due to an infection other than with the originally diagnosed strain to have outcomes that were Not Assessable.
Withdrawal of consent/ lost-to-follow up. Across study protocols, outcomes of patients who were lost to follow-up or who withdrew from the study appeared to be the most challenging to categorize. These patients were variously noted as having been lost to follow-up or withdrawn: (1) while still being treated; (2) at any point, during follow-up, (3) after being cured at the end of treatment, during follow-up, or (4) "when last seen." During the treatment phase. With respect to patients lost or withdrawn while treatment was still ongoing (without further caveats), a quarter of the protocols classi ed their outcomes as Unfavorable; one study alone categorized them as Not Assessable. Other protocols determined categorization based on the reason for the patient's withdrawal. In one protocol, patients who withdrew or were lost due to clinical reasons were considered to have an Unfavorable outcome. More frequently, patients who exited the study during the treatment phase were considered to have outcomes that were Not Assessable, including those whose withdrawal was either unrelated to TB or was due to protocol violation, pregnancy, or moving away and/or becoming untraceable at any point.
After treatment completion. In addressing patients who were lost to follow-up or who withdrew after completing treatment, Unfavorable outcomes could include those who exited the study under any circumstances (although one protocol classi ed such a patient as having an outcome that was Not Assessable); those whose last positive culture was not followed by at least two negative cultures ≥7 days apart; those who terminated the study early, but were known to be alive at last contact, or who were lost to follow-up with vital status unknown; patients who had not achieved culture negative status or who had been classi ed as having an Unfavorable outcome before their withdrawal; patients who could not be contacted for some speci ed period of time prior to the last study visit; and those who had no culture results within a speci ed window of time prior to the study endpoint. As speci ed by two protocols, it was also necessary for these latter patients to be either culture positive when last tested, have no other post-baseline results, or have a negative culture at their most recent result, but with radiological or clinical symptoms that were inconclusive.
Alternatively, the following patients were categorized with varying frequency as having outcomes that were Not Assessable: those whose last culture before study exit was negative; patients whose last two culture results prior to exit were negative, who had not otherwise been deemed Unfavorable; patients whose last culture was negative and whose last positive culture was followed by at least two negative cultures at different visits ≥7 days apart, without an intervening positive culture; and patients not otherwise classi ed as Unfavorable prior to exit from study.
Patients who withdrew or were lost to follow-up after having been cured at the end of treatment were speci cally addressed by one study; those who either did so with their most recent culture positive or who moved away with their most recent culture positive were considered to have Unfavorable outcomes, while those who under the same circumstances were culture negative or whose most recent culture was contaminated were categorized as Not Assessable.
In some protocols, outcomes were de ned at the time when patients "were last seen." Detailed events included being culture positive, being culture positive with the same type (whether con rmed or not), culture positive not followed by two negatives, or simply not having achieved or maintained culture negative status at the time of their last visit (prior to study endpoint).

Treatment-Related Issues (including treatment changes for adverse events)
Most protocols addressed to some extent their analysis plans regarding treatment issues, including extension, restart, change, and discontinuation of the medications which comprised the speci c study regimens. Although in most cases patients who experienced treatment disruptions were considered to have unfavorable outcomes, details varied considerably from study to study. A patient whose treatment was extended for any reason was considered to have had an unfavorable outcome by one study. More commonly, however, the outcomes of patients whose treatment was extended were considered Unfavorable but with exceptions that were considered Not Assessable, including: temporary drug rechallenge, over-treatment with assigned drugs, ≤21 days non-study anti-TB meds for active TB, secondary isoniazid preventive therapy in HIV+ patients, re-infection, pregnancy, making up missed doses, and remaining on treatment at the end of the study without having been declared a treatment failure.
Some protocols categorized patients whose treatment had to be restarted as experiencing an Unfavorable outcome, again with the exceptions that they either had been infected with a different TB type in some cases or had become pregnant in others; another protocol limited designation of an Unfavorable outcome to the period after completion of treatment but before the study's end.
A change in treatment can take many forms, and this was re ected across protocols. A patient who had any change of medication frequency or dose (except in the case of re-infection) was usually considered as having an Unfavorable outcome, although two protocols made exceptions for patients with a single drug replacement, or those whose drug replacement was due to a guideline change in the standard of care group (neither affected outcome classi cation). Patients whose treatment was changed due to clinical or radiological deterioration, or because of non-response or poor adherence, were considered to have an Unfavorable outcome by one study each, respectively. Several studies considered as Unfavorable outcomes those of patients for whom one drug was replaced or added, while other studies required that a patient's treatment involve the replacement or addition of at least two drugs. Such categorization based on number of drug changes ranged from the simple to the overly complex: one study placed further conditions on a two-drug change, declaring that it de ned a patient's outcome as Unfavorable if this occurred because the patient (1) had not converted by the end of the rst (more intense) phase of treatment, (2) had bacteriologically reverted during the second treatment phase after having converted to negative in the rst, (3) had evidence of additional acquired resistance to uoroquinolones or 2 nd line injectables, or (4) had not converted their sputum cultures to negative status and had two positive cultures during a speci c time period, with the caveat that if one or more of the samples were unavailable or contaminated this would be considered culture positive if the patient displayed deteriorating clinical symptoms.
A patient whose treatment was discontinued was considered by various protocols as having an Unfavorable outcome if study treatment was halted for reasons including the following: experiencing a serious adverse event; starting a different DR-TB regimen; failing to convert after the rst phase of a trial where the treatment regimen occurred in two phases, or because the trial regimen needed to be signi cantly modi ed for some (unspeci ed) reason. In most trials, study treatment was discontinued in patients who became pregnant during therapy, who were then treated with standard therapy. In some trials, patients who discontinued treatment because they became pregnant were considered to have an outcome that was Not Assessable, while in others a patient's outcome was considered Not Assessable if the patient's last culture was negative, but Unfavorable if it were positive.
Incomplete treatment in patients whose culture status could not be evaluated at the end of follow-up was considered Unfavorable in several studies; an additional protocol de ned a patient's outcome as Unfavorable if, in addition to incomplete treatment, a patient had not attained culture negative status by the end of follow-up. The effect of a patient's missing drugs during the treatment phase was addressed by one protocol that considered this to be Unfavorable if some or all drugs were missed regularly, or if all drugs were missed for more than two consecutive weeks.
Patients who took TB-related but off protocol drugs, or who started TB treatment outside of the study with the most recent culture positive, were considered by one study to have an Unfavorable outcome, while off-protocol drugs not related to TB rendered the outcome Not Assessable. In two other studies, only patients taking speci c off-protocol drugs were categorized as having Unfavorable outcomes.

Discussion
In our review of primary e cacy outcomes as de ned in the protocols (and SAPs, if available), in 31 con rmatory clinical TB trials for the treatment of active TB, we found broad conceptual agreement. A patient's outcome was classi ed as Favorable or Unfavorable based on the number and timing of negative/positive cultures, and most protocols explicitly acknowledged that outcomes were Not Assessable under certain circumstances (in other cases, this was implicit in descriptions of inclusions and exclusions from given analyses). However, even though they achieved compliance with decidedly broad guidelines for trial sponsors provided by stringent regulatory authorities(17, 18), we found a considerable degree of heterogeneity in outcome de nition across trials. In addition, outcomes were for the most part comprised of composite events, and inconsistencies abounded with respect to the ways in which outcome de nition was determined by such factors as deviation from treatment regimens, patient withdrawal/loss to follow-up, relapse or reinfection, and even death; the contributions of the individual components to the composite outcome were in all cases unweighted. These outcomes then dictated inclusions and exclusions from different target populations for analysis-ITT, mITT, and per protocol (PP) -which in turn were variously considered to be of primary, secondary, or equal importance; sensitivity analyses were also sometimes variously used.
Nonetheless, we found that certain areas were treated consistently across protocols, indicating implicit areas of consensus that would facilitate standardization of endpoint de nitions. The fairly straightforward criteria for determining a Favorable outcome allowed this diagnosis to be more easily reached, as compared to an Unfavorable or Not Assessable one. While details differed in terms of the number and timing of cultures indicating conversion, and the duration required to validly declare it a durable, long-term cure, it is probable that a consensus de nition of a patient with a Favorable outcome would neither be di cult to reach, nor particularly controversial, across investigators and trials. Similarly, a patient who suffered a relapse with the originally diagnosed infecting strain of M.tb after reaching con rmed culture negative status was universally classi ed as having had an Unfavorable outcome, although here, too, small variations in de nition are needed to standardize this event across trials.
Understandably, standardizing Unfavorable outcomes presents a far greater challenge. In a systematic review of outcomes reported in 248 peer-reviewed and published phase III TB studies, Bonnett et al. found substantial differences in the way Unfavorable outcomes were de ned and implemented across numerous dimensions (19). That review was limited to data derived from trial publications and included TB trials from 1950 to 2017 (only 18% of which occurred after 1995), yet Bonnett reported inconsistencies as to what constituted an Unfavorable outcome even in the most recent trials. In our review, with the granular data obtained from the more necessarily detailed study protocols and SAPs (all of which had been registered since 1995), we likewise found little consensus in the speci c details attached to endpoint de nitions for Unfavorable.
As a result, combining data, interpreting and comparing results, and performing individual patient data (IPD) meta-analyses across trials that essentially are all working towards the same goal is at best highly challenging (as experienced with the largest such analysis of TB clinical trial data(4)) and at worst, impossible. The concept of a "Favorable" outcome can be more di cult to de ne for patients who fail to produce sputum, do not complete the trial, or require treatment changes. Distinguishing between Unfavorable and Not Assessable outcomes, as we have shown, presents even greater potential for discordance, and more differing opinions about what event constitutes each. Adding to the confusion is the fact that some patients inevitably exit the study prematurely, and their reasons for not completing the study, along with their culture status at the time of their exit, are inconsistently used to classify their outcomes as Unfavorable or Not Assessable. Even death, an undeniable and immutable outcome, is cause for dissension; while all protocols considered a patient who died from TB to have had an unfavorable outcome, deaths that were not related to TB could be interpreted as Unfavorable or Not Assessable, depending on circumstances. The conventional use of composite outcomes, which may or may not be directly related to treatment, and that involve numerous assumptions about each "piece" of the outcome, further clouds evaluations of e cacy.
The ICH E9 (R1) addendum, in providing a framework and language for de ning clinical trial estimands and outcomes, directly addresses these problems. While no guidelines can cover all circumstances that may arise in a trial, and it is unlikely that one estimand will satisfy the interests of all categories of trial stakeholders, interpretation and comparison across trials would be greatly facilitated by a standardization of the elements of a trial used to evaluate the e cacy of its intervention. In our review of protocols, we found a wide range of granularity of de nitions. While some protocols de ned outcomes in the most general terms, others complicated de nitions by attempting to cover every possible eventuality; in the latter case, the outcome de nitions were clouded with minutia, making consensus with other trials extremely unlikely. On the other hand, the absence of precise de nitions of outcomes in the protocol or SAP means that some classi cation decisions are left to the data analyst; these may have only been "documented" in the analysis code, which is rarely reviewed by study investigators. The ICH E9 (R1) addendum has taken an instructive approach in addressing these problems, separating from the de nition of the primary outcome, called an estimand, the many events that occur and either preclude or affect observation of an outcome (referred to by the ICH E9 as "intercurrent events"). This allows not only for a consistent de nition of the primary e cacy outcome across trials but also gives a structure for specifying how intercurrent events will be handled in the analysis, thus reducing potential in ation of Type I and Type II errors. Events that have until now been viewed as rendering an outcome "Not Assessable" can rather be categorized as intercurrent events, with decisions about how they will be dealt with in analyses made prior to the beginning of the trial, based on the de ned estimands. Thus, within the same trial, an intercurrent event may be treated one way for one estimand, and another way for a second estimand, dependent on the needs of particular stakeholders.
Bringing such events to the forefront therefore would allow for standardization in the way they are classi ed and how they are treated in the analysis, resulting in the reporting of outcomes that are comparable across trials. Even if standardization is not possible between different trials, at the very least the ICH E9 (R1) addendum provides a lingua franca for speci cation to support clear interpretation and translation into clinical practice guidelines. Some preliminary work has been done in this area in the context of individual trials (20,21).
Our study has several limitations. We were not able to obtain an SAP for every trial we reviewed, and in these cases lacked the more detailed descriptions of how events would be dealt with in analyses than are often provided in the protocol. The sheer length of the protocols themselves made locating speci c pieces of information di cult (and in some cases, particularly in the case of older studies, it was not present or was treated in general terms, with speci cs purportedly left to the SAP). While our search for clinical trials was as thorough as possible, we cannot be sure that all recent clinical trials were included.
We did not receive responses from investigators for 17 of the 51 trials selected for protocol review. These were necessarily excluded, although many may not have been within the scope of our review (notably some of the country-speci c trial registries had limited data to assess whether trials were within scope).
We did not distinguish trials of unlicensed drugs conducted under stringent regulatory oversight authorities and often sponsored by the pharmaceutical industry from the investigator-initiated trials of licensed drugs. Both are designed to inform policy and practice, and better speci cation and standardization of endpoint de nitions is relevant to all future TB treatment trials.
While no estimand can include all possible trial occurrences, the standardization of de nitions and of the treatment of intercurrent events that occur most frequently will enhance comparability across trials, while allowing for interpretation of rare or unanticipated events. The approach outlined by the ICH E9 addendum can also be used to develop different estimands to address the concerns of speci c audiences. It is therefore important, following the recommendations of the ICH E9 addendum, to prioritize both the speci cation and standardization of outcomes across TB trials. As new drugs and treatment regimens are discovered and tested in trials, the ability to make valid comparisons to old treatments and regimens is also essential if researchers are to effectively collaborate towards our common goals of developing shorter, simpler, and more effective and safe treatment to cure patients with TB. Following this review, our next step will be to produce recommendations for estimands and methods of estimation for TB treatment trials. At a time when the world has begun to establish large adaptive platforms with core protocols for the search of active treatment of COVID-19, it is past time that we, as a TB community, move towards better standardization and harmonization of trial methods. Authors' contributions: PP was responsible for conception and design of the work. PP and JL substantially contributed to identi cation of studies and review for eligibility and inclusion. PP was responsible for obtaining protocols and SAPs from study investigators. All authors contributed to the interpretation of the data and revised the work. All authors have approved the submitted version. All authors have agreed both to be personally accountable for the authors' own contributions and to ensure that questions related to the accuracy or integrity of any part of the work, even ones in which the author was not personally involved, are appropriately investigated, resolved, and the resolution documented in the literature. Tables   Tables 1-2 and S1 are available in the Supplmentary Files. Figure 1