Variation in the Accuracy of Endpoint Selection in Clinical Studies for Rare Diseases in Respect of Increasing Knowledge on Disease-Severity Measurement


 Background: Previous research assessed the accuracy of disease-severity measurement in clinical studies as a mathematical relationship between the set of endpoints selected and the disease-severity scale (DSS), a surrogate for the theoretical Neutral list of indicators representing the disease phenotype. New DSSs are continually developed, so clinical studies’ operationalisation of the Neutral list and resulting relative neutrality may vary over time. We assessed variation in the neutrality of clinical studies over time and the probability of false positive and false negative classifications at different disease prevalence rates.Methods: We used search strings extracted from the Orphanet Register of Rare Diseases using a proprietary algorithm to conduct a systematic review of studies published until January 2021 per Preferred Reporting Items for Systematic Reviews and Meta-Analysis guidelines. Overall, 483 studies and 12 rare diseases met inclusion criteria. We extracted all indicators from clinical studies and calculated neutrality and its components, sensitivity and specificity, as well as the probability of misclassifications at 20%, 50% and 80% disease prevalence rates at two time points, the times of publication of the first and last DSS. Surrogate Neutral lists were the first DSS and a composite of all later DSSs.Results: Over time, the neutrality of clinical studies increased for six diseases and decreased for five diseases, driven by sensitivity for all but Friedreich ataxia. The neutrality of clinical studies in encephalitis decreased, but sensitivity remained constant at zero. At both timepoints, the likely false negative rate increased and the likely false positive rate decreased with increasing disease prevalence. The probability that the least neutral clinical study for most diseases would yield a false positive result was equal to one at all disease prevalence rates. Conclusions: The potential for accurate clinical trial disease-severity measurement increases over time. Neutral theory showed that endpoint selection and DSSs may need improvement in Charcot Marie Tooth disease, Gaucher disease Type I, Huntington’s disease, Sjogren’s syndrome and Tourette syndrome. Using Neutral theory to benchmark disease-severity measurement in rare disease clinical trials may reduce the risk of misclassification, ensuring that recruitment and treatment effect assessment optimise medicine adoption and benefit patients.


Background
Ensuring the accuracy of clinical trial endpoints in rare diseases is important not only because trial results inform treatment decisions but also because disease progression for such conditions may remain poorly understood despite growing research attention. 1,2 Further, clinical trials for rare diseases are often small and have low power by default, so optimising measurement accuracy may ensure that endpoints translate to clinically meaningful results. 1 Clinical trials use distinct sets of indicators to observe disease severity, so the accuracy of disease severity measurement is affected by inter-trial variation, 3 which could be bene cial to address. This problem with inter-trial variation in accuracy has been quanti ed using Neutral theory, which showed that when such sets of indicators are assessed against an aspirational list of all possible indicators of disease severity, one observer is likely to be more accurate than another. 4 This may result in variation in the number of patients classed as meeting a threshold of severity between clinical studies. 3,4 Here, neutrality describes how accurately a set of clinical trial endpoints re ects the ideal construct for the disease being observed, which can be theoretically represented as an exhaustive list of disease severity indicators (the "Neutral list"). 5 Neutrality can be expressed as the sum of a measure's sensitivity (number of relevant indicators excluded) and speci city (number of irrelevant indicators included), with a maximum score of two. Currently, the neutrality of disease severity measurement in rare disease clinical trial settings may range from zero to two, suggesting a need for improvement in both the sensitivity and speci city of disease severity measures used in clinical trials. This is especially salient in rare diseases, in which patients commonly show variation in disease phenotype 6 or may present atypically with severe disease in the absence of usual severity indicators. 7 While disease severity measures may correlate, 8 one study showed that as many as 20% of trial patients could have been excluded due to inter-trial variation in disease-severity measurement. 3 Previous work on the neutrality of disease severity measurement has limited its scope to diseases with a single disease-severity scale (DSS) as a surrogate for the Neutral list for a disease. In this paper, we quanti ed the neutrality of disease-severity measurement in rare diseases with more than one DSS. This was necessary because severity measures are continually updated over time through scienti c research, 9 so diseases have various DSS measures published at different times. In practice, if the neutrality of disease measurement in clinical trials varies over time, then the number of patients misclassi ed also varies at different time points. This has implications for the reliability of trial results and the decisions they inform. At a given time, the number of patients misclassi ed depends on the neutrality of the list of disease severity indicators used in clinical trials as a surrogate for the Neutral list. 5 Neutrality is also disease-speci c, so while the Neutral list is a theoretical and unattainable ideal in practice, it can be used, in principle, to guide research streams to identify sets of indicators that are the best match for the Neutral list of a given disease at any given time. If the neutrality of disease measures changes over time, as the body of knowledge increases, then the best match for the Neutral list must be continually researched and updated as new evidence into disease-severity measurement becomes available.
To test this, we measured the neutrality of clinical studies at two time points, the time of publication of the rst DSS and the time of publication of the most recent DSS, with the Neutral list surrogate at the second time point being a composite of all DSSs published up until that time. We assumed that the composite would re ect a more optimal match to the Neutral list, as it was based on a greater body of research, and new indicators of disease severity are explored continually as research advances. 11 Clinical studies were xed compared to the rst and composite DSS, so we expected the neutrality of clinical studies measured against them to decrease and the degree of patient misclassi cation to increase between the times of publication of the rst and last DSS. In other words, we hypothesised that the neutrality of clinical studies would vary as a function of the body of research conducted on the disease.
Overall, this work focuses on the in uence of time as a function of the wider body of knowledge done between two points and its impact on the neutrality of disease measurement and patient misclassi cation in clinical trials for rare diseases.

Aim
The aim was to assess variation in the neutrality of clinical studies over time and the probability of false positive and false negative classi cations at different disease prevalence rates.

Design
A systematic review and statistical analysis were conducted.
Identi cation of clinical studies and rare diseases with more than one DSS A previous study identi ed 26 rare diseases with at least one associated DSS by systematic review with search strings generated from all diseases listed on the Orphanet Register of Rare Diseases using a proprietary algorithm. 4 Of these, 15 were identi ed as having more than one DSS. A systematic review was conducted according to the preferred reporting items for systematic reviews and meta-analysis (PRISMA) guidelines on the MEDLINE database, including all studies published until 21 January 2021.
For each disease, the review used generic search strings as well as speci c search strings generated from the Orphanet Register of Rare Diseases. All search strings have been provided in the appendix. To be included, studies had to be randomised controlled trials or observational clinical studies, conducted with human participants, have clearly recorded outcomes/endpoints, have full text available and be written in the English language. Animal and in-vitro studies were excluded. For a disease to be included in the study, there had to be at least two associated DSSs and more than ve studies that used each of its associated DSSs to measure the severity of the disease in published literature. The studies also had to state clearly the disease studied. Two research analysts conducted the reviews, and the initial screening of titles and abstracts was conducted using Rayyan, 12 a web-based tool that automates systematic review processes.
Changes in the neutrality of clinical study disease measurement over time The neutrality of clinical study disease measurement was assessed by rst extracting all indicators reported in included clinical studies for each disease. Neutrality was de ned as the sum of the sensitivity and speci city of these sets of endpoints, where sensitivity was the proportion of relevant indicators included and speci city was the proportion of irrelevant indicators excluded as compared to the Neutral list. 5 In this study, we used the DSS as a surrogate for the Neutral list. As we expected that the Neutral list for each disease would vary over time as a function of the amount of research completed, we measured the neutrality of sets of clinical endpoints at two set points, the times of publication of the rst and last DSS. Our surrogate Neutral list at the second time point was a composite of all indicators in all DSSs published at that time that met the inclusion criteria. We excluded duplicate indicators from the composites for each disease. We calculated the mean neutrality, sensitivity and speci city of measures used in clinical studies for each disease against the surrogate Neutral lists at each time point. Then, we calculated the neutrality, sensitivity and speci city of the most and least neutral clinical studies and computed the rates of false positive and false negative classi cations these measures were likely to lead to at different disease prevalence rates according to previously described methods. 5 We observed neutrality at 20%, 50%, and 80% disease prevalence rates, representing clinical trial, observational study and clinical/outpatient settings, respectively. We treated sensitivity and speci city as statistically independent and included indicators from the composite DSS as part of the total information observed in the analysis for the rst DSS (Table 1).

Rare diseases, disease-severity scales and clinical studies included
Overall, 483 of the 2942 studies reviewed were included ( Figure 1). Of the 26 diseases identi ed as having at least one associated DSS, 15 were identi ed as having more than one DSS. Gaucher disease Type 3 and Niemann Pick disease were identi ed in early screening as having more than one DSS but were eventually excluded due to insu cient published data on their disease-severity measures. Crohn's disease also had more than one DSS, but research using the DSS did not distinguish between the disease measured, describing patients as having in ammatory bowel disease or a combination of both Crohn's and ulcerative colitis. Ulcerative colitis was excluded before review, as the previous study had identi ed it as having insu cient published data using its DSSs. The following 12 rare diseases were included (number of DSSs): acromegaly (3), amyotrophic lateral sclerosis (6), Charcot Marie Tooth disease (5), cystic brosis (5), encephalitis (4), Fabry disease (2), Friedreich ataxia (3), Gaucher disease Type 1 (2), Huntington's disease (2), juvenile rheumatoid arthritis (2), Sjogren's syndrome (2), and Tourette syndrome (3). An overview of the diseases, the composition and publication dates of their rst and composite DSSs and the number of indicators in each has been provided in Table 2. Full details of indicators included in rst and composite DSSs are available upon reasonable request. As shown in Table 1, the number of unique indicators available as the Neutral list for all diseases increased over time. Changes in the neutrality, sensitivity and speci city of clinical studies over time Table 3 shows the mean change in neutrality, sensitivity and speci city of clinical studies over time when compared against the rst and composite DSS as well as the number of clinical studies included for each disease. We found two main effects. First, for almost half the diseases (acromegaly, amyotrophic lateral sclerosis, cystic brosis, Fabry disease, and juvenile rheumatoid arthritis) the mean neutrality of clinical studies increased over time, and this appeared to be driven by an increase in sensitivity. The magnitude of increase in the mean neutrality of clinical studies for acromegaly (0.135), amyotrophic lateral sclerosis (0.094), Fabry disease (0.083), and juvenile rheumatoid arthritis (0.126) varied, with cystic brosis (0.022) showing the least variation. The sensitivity of clinical studies in most of these diseases increased threefold, but for juvenile rheumatoid arthritis, it was closer to a vefold increase. Compared to this, the increase in the mean speci city of clinical studies for each of these diseases was negligible. For Friedreich ataxia, there was an increase in the neutrality of clinical studies, but the increase in sensitivity was much more modest than other increases in sensitivity and was of a similar magnitude to the increase in speci city. Changes in potential false positive and false negative classi cations arising from changes in the neutrality, sensitivity and speci city of clinical studies over time Figures 2 and 3 show the potential false positive and false negative classi cations arising from changes in the neutrality, sensitivity and speci city of clinical studies over time when assessed against the rst and composite DSS, respectively. First, overall, for both rst and composite DSSs, the probability of a false negative result increased with increasing disease prevalence, while the probability of a false positive decreased. Second, for both rst and composite DSSs, the probability that the least neutral clinical study for most diseases would yield a false positive result was equal to one at all disease prevalence rates. However, for encephalitis, this was true for both the most and least neutral study. There were no instances of the probability of a false negative equalling one for either the rst or composite DSS.

Discussion
There is a pressing need to improve the accuracy of clinical trial endpoints in rare diseases. 1 Previous research has addressed this by assessing the neutrality of clinical endpoints in rare diseases using the DSS as a surrogate for the Neutral list. 4 However, some diseases had more than one DSS, meaning that the neutrality of any study measured against them may vary over time as a function of the body of knowledge on a disease. We expected that the neutrality of the xed sample of clinical studies would decrease over time, as the body of knowledge (operationalised here as the number of indicators) increased. The number of indicators in the surrogate Neutral list for all diseases increased over time, but we found that the neutrality of clinical studies for one subgroup of diseases increased over time, while it decreased for another subgroup. This suggested that the neutrality of clinical studies changed in different ways with respect to the body of knowledge.
In the rst subgroup of diseases, the mean neutrality of clinical studies was higher when measured against the composite DSS than the rst DSS, and this appeared to be driven by an increase in sensitivity.
That is, as the number of indicators in the surrogate for the Neutral list increased over time as a function of growth in the body of knowledge, clinical studies included a greater proportion of its indicators. This was not merely a function of the higher number of indicators increasing the probability of a match between the composite DSS and clinical studies, because we did not nd this pattern across all diseases. Rather, this suggested a convergence of knowledge whereby DSS indicators generated through scienti c research over time also showed up in the group of clinical studies assessed. We assumed that the sample of clinical studies would be static in terms of growth in the body of knowledge; however, it contained research spanning many years, so the body of knowledge that informed the composite may also have informed a proportion of the clinical studies in the sample. In the study of scienti c epistemology (for example, Peirce's convergence of truth and the mathematics upon which it is based), a convergence of knowledge on a construct observed alongside an increase in sample size can be taken as a sign of the validity of the knowledge generated. 13,14 The convergence of knowledge between DSSs and clinical studies in the rst subgroup suggested that the disease phenotype operationalised by the indicators shared between them tended towards a more accurate representation of the theoretical Neutral list over time, producing more accurate measures of disease severity. In the second subgroup of diseases, we observed a decrease in neutrality, which we expected under the incorrect assumption that clinical studies would be unaffected by the increasing body of knowledge. Given our ndings in the rst subgroup, this can be better interpreted as a divergence of knowledge. If the clinical and DSS studies were methodologically sound, then this divergence may be fertile ground for hypothesis building and further knowledge generation. 15 Further research may examine qualitative differences between indicators in each disease, as our ndings suggested that DSSs in the rst subgroup were more likely to contain indicators that were speci c, measurable and objective and that were pathophysiological as well as behavioural and psychological. As we measured clinical studies as a homogenous group and did not separate them out into two timepoints, our ndings cannot suggest that the changes in neutrality found represented a speci c relationship between neutrality and time within the diseases studied. However, our methods were su cient to demonstrate that changes in how the Neutral list is operationalised over time affect the accuracy of clinical trial disease measurement and that this must be accounted for during the selection of endpoints. The inaccurate measurement of disease severity in clinical trials may result in patient misclassi cation. 3,9,10 We measured the impact of neutrality via its components, sensitivity and speci city, on the probability of detecting false negative and false positive results at different disease prevalence rates. In a clinical trial setting (20% prevalence rate), in many diseases, the probability of a false positive was equal to one (the classi cation of a patient as 'severe' when they are 'not severe'). If these disease-severity measures were used as inclusion criteria for trials, our ndings suggested a high probability of including patients outside of the target population. Additionally, in many diseases, speci city was equal to zero, meaning that all indicators observed in clinical studies were irrelevant to disease severity. The detection of a treatment effect in these cases could result in the licensing of a medicine with little clinical signi cance to patients. If no treatment effect was detected, then trials may be abandoned, and effective medicines may be rejected at the regulatory stage, meaning that potentially life-changing medications may fail to reach patients, which is a recurrent problem in rare disease clinical trials and may be attributed to lack of neutrality in endpoint selection. 24,25 Further, for these diseases, outcomes of relevance to disease severity may be underrepresented in the body of research, so patients may not bene t from ongoing evidence generation regarding the problems they deal with in their day-today lives. We observed a similar pattern of data at all prevalence rates that became more pronounced as prevalence increased. This was in line with previous ndings and gave con dence in our results. 4 Limitations First, we assumed that the DSS was a surrogate for the Neutral list, as it was the most accurate representation of the disease phenotype available. However, the Neutral list is an empirically unattainable theoretical concept. 5 This is likely to have resulted in an over-estimation of the neutrality of clinical studies in this study than if a 'true' measure of neutrality was made. Second, Neutral theory assumes that indicators are independent of each other; however, associations may exist between indicators to varying degrees. Finally, we did not control for the effect of time of publication of clinical studies, which may be reasonably expected to affect the number of indicators they shared with the surrogate Neutral list to some degree (clinical studies published before composite DSSs may be less likely to contain their indicators, although this is not guaranteed, as DSSs are generated based on existing bodies of knowledge shared by those who conduct trials). The variation in the year of publication of DSSs between diseases was not suggestive of a confound in respect of the effects noted in this study, and most DSSs were published between 5 and 10 years before the analysis.

Conclusions
Our results suggested that the potential for accuracy in measuring disease severity increases as a function of the body of knowledge on a disease. The neutrality of almost half the rare diseases in this study increased as the body of knowledge increased, while the neutrality of almost half decreased, suggesting that sustained research efforts in some diseases resulted in the development of more accurate measures of disease severity implemented in DSSs and clinical studies. The application of Neutral theory could enhance the accuracy of endpoint selection in clinical trials and verify the accuracy and relevance of treatment effects as well as ensuring that the risk of misclassi cation during trial recruitment and the assessment of treatment effects is kept as low as possible. Further research may be bene cial to develop more accurate disease-severity measurements in Charcot Marie Tooth disease, Gaucher disease Type I, Huntington's disease, Sjogren's syndrome and Tourette syndrome.
Abbreviations DSS Disease-severity score Declarations Ethical Approval This study was exempt from ethical approval as the study authors collected and synthesized data from previous publications, in which informed consent had already been obtained by the primary investigators, per Preferred Reporting Items for Systematic Reviews and Meta-Analysis guidelines. No human participants were recruited for this study.

Consent for Publication
Not Applicable.

Availability of data and materials
All data generated or analysed during the current study are included in this published article and are available from the corresponding author on reasonable request.

Competing interests
The author is a visiting senior lecturer at the Centre for Pharmaceutical Medicine Research at King's College London and is responsible for research into real-world evidence approaches. He is also the founder and CEO of Medialis Ltd, a medical affairs consultancy and contract research organisation involved in the design and delivery of real-world evidence in the pharmaceutical industry.

Funding
The author received no funding for this work.
Author contributions RJ conducted the study and developed and approved the manuscript. The author a rms that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted, and any discrepancies from the study as planned (and, if relevant, registered) have been explained.   Potential misclassi cations for all diseases in respect of the composite DSS