Estimation of sensitivity and specificity and calculation of sample size for a validation study with stratified sampling


 Background:We propose and evaluate the approximation formulae for the 95% confidence intervals (CIs) of the sensitivity and specificity and a formula to estimate sample size in a validation study with stratified sampling where positive samples satisfying the outcome definition and negative samples that do not are selected with different extraction fractions. Methods:We used the delta method to derive the approximation formulae for estimating the sensitivity and specificity and their CIs. From those formulae, we derived the formula to estimate the size of negative samples required to achieve the intended precision and the formula to estimate the precision for a negative sample size arbitrarily selected by the investigator. We conducted simulation studies in a population where 4% were outcome definition positive, the positive predictive value (PPV)=0.8, and the negative predictive value (NPV)=0.96, 0.98 and 0.99. The size of negative samples, n0, was either selected to make the 95% CI fall within ± 0.1, 0.15 and 0.2 or set arbitrarily as 150, 300 and 600. We assumed a binomial distribution for the positive and negative samples. The coverage of the 95% CIs of the sensitivity and specificity was calculated as the proportion of CIs including the sensitivity and specificity in the population, respectively. For selected studies, the coverage was also estimated by the bootstrap method. The sample size was evaluated by examining whether the observed precision was within the pre-specified value.Results:For the sensitivity, the coverage of the approximated 95% CIs was larger than 0.95 in most studies but in 9 of 18 selected studies derived by the bootstrap method. For the specificity, the coverage of the approximated 95% CIs was approximately 0.93 in most studies, but the coverage was more than 0.95 in all 18 studies derived by the bootstrap method. The calculated size of negative samples yielded precisions within the pre-specified values in most of the studies.Conclusion:The approximation formulae for the 95% CIs of the sensitivity and specificity for stratified validation studies are presented. These formulae will help in conducting and analysing validation studies with stratified sampling.


Background
In studies constructed to evaluate the validity of an outcome definition implementing chart reviews as the gold standard, the chart review is the most time-consuming part of the study. This is one of the main reasons why many validation studies estimate the positive predictive value (PPV) only when using chart reviews as the gold standard [1]. For example, in a recent review of 14 validation studies on dementia diagnoses, McGuinness stated, "Most reported only the positive predictive value (PPV)" [2]. In a paper published in the FDA-funded Mini-Sentinel pilot programme, the authors stated, "It was determined that 100 charts would be sufficient to obtain a reasonable PPV and establish the validation process" [3]. There are, however, several validation studies where many charts are reviewed to estimate not only the positive predictive value (PPV) but also other measures of validity (negative predictive value (NPV), sensitivity and specificity). For example, Widdifield et al. used as many as 9,500 random samples to identify 107 patients with rheumatoid arthritis (RA): 7,500 random samples of patients aged ≥20 years to identify 69 RA patients and an additional 2,000 aged ≥65 years to identify 38 RA patients [4]. Instead of employing the random sampling strategy, one may use the stratified sampling strategy, where positive samples obtained among patients who satisfy the outcome definition and negative samples obtained from patients who do not satisfy the definition are selected with different extraction fractions. In validation studies using the stratified sampling strategy, however, there seems to be no generally accepted approach to estimate measures of validity. For example, in a validation study published in 2014 by Husain et al. on nonalcoholic fatty liver disease, the sensitivity and specificity were estimated in a "population" of 600 patients, consisting of 450 who satisfied the outcome definition and 150 who did not [5]. One reason why the measures of validity were estimated in this artificial "population" but not in the original population may be that there was no generally accepted approach to estimate the sensitivity and specificity and their confidence intervals (CIs) in the original population when the stratified sampling strategy was employed.
One valid approach to obtaining these estimates is the use of the bootstrap method to estimate the CIs of the sensitivity and specificity as in a validation study of preeclampsia in Norway published in 2014 by Klungsøyr et al. [6]. In this study, all (i.e., 100%) of the 3,500 women in a pregnancy cohort registered with preeclampsia (positive samples) and 1,840 (2.4%) random samples from 75,311 women without registered preeclampsia (negative samples) were examined utilizing antenatal charts and hospital discharge codes used as the gold standard. The CIs of the sensitivity and specificity in the original population of 78,811 women were estimated by the bootstrap method rather than the CIs in the artificial (4) "population" of 5,340 women [6].
In the current study, we present approximation formulae for the CIs of the sensitivity and specificity in the original population in a validation study with stratified sampling. The formulae may be useful when the standard statistical package for the bootstrap method is not readily available. We also propose a relevant formula to estimate the size of negative samples required to achieve the intended precision or to estimate the size of the precision when an arbitrary negative sample size is used, provided that the PPV has already been estimated from positive samples.

Approximated 95% confidence intervals (CIs) of the sensitivity and specificity from a validation study with stratified sampling
The approximated 95% CIs of sensitivity and specificity were derived using the formula for the logarithm of the risk ratio [7] and the delta method (see Additional file 1 for the derivation) as follows.
The 95% CI of the sensitivity, 95% , is approximated as: Similarly, the 95% CI of the specificity, 95% , is approximated as: In Equations (1) and (2), "a" is the number of true positives (TPs) and "b" is the number of false positives (FPs) among positive samples of n 1 subjects (n 1 =a+b), while "c" is the number of false negatives (FNs) and "d" is the number of true negatives (TNs) among negative samples of n 0 subjects (n 0 =c+d). The expected point estimate of the sensitivity (̂) in Equation (1) is approximated as: as: In Equations (3) and (4), ̂ is the estimate of the positive predictive value in positive samples, given as ̂= a/n 1, ̂ is the negative predictive value in negative samples, given as ̂= d/n 0 =(1-c/n 0 ), N 1 is the number of subjects who satisfy the outcome definition, and N 0 is the number of subjects who do not in the original population.
Alternative formulae for 95% and 95% worth exploring are given as: : where is given as and 95% =̂ ± where is given as When comparing the calculation of 95% with Equations (1) and (5), the upper limit is the same, but the lower limit in Equation (5) is lower than that in Equation (1). Similarly, the upper limit of 95% in Equation (2) is the same as that in Equation (6), but the lower limit in Equation (6) is lower than that in Equation (2). (6) positive samples, as "Stage I" and the next stage, in which the NPV, sensitivity and specificity are estimated, as "Stage II". In the proposed method, the chart reviews in Stage I should be

Estimation of the negative sample size and precision in a validation study with stratified sampling
In Equation (7), N 0 /N 1 is obtained from the information on the population, the values of "a" and ̂ are obtained in Stage I, * is the precision of ̂ that must be obtained (i.e., the 95% obtained at the end of the study will be ̂± * or narrower) and * is the sensitivity used to calculate the size of negative samples (n 0 ). If good information on the sensitivity (e.g., the sensitivity estimated in a past study conducted in a similar population) is available, a likely value of the sensitivity may be used as * in Equation (7). However, if no good information on the sensitivity is available, one may use * , which produces the possible largest value of n 0 (defined as n 0max ). The value of n 0max can be determined numerically. Alternatively, * in Equation (7) may be fixed as 0.7 because n 0 in Equation (7) varies with * but is maximal (or n 0max ) when * is approximately 0.7 (see Additional  Table 1 and Studies We simulated two-stage validation studies with stratified sampling. In Stage I of all studies, 100 (n 1 ) random positive samples were selected from N 1 definition-positive subjects and evaluated by chart reviews. After the completion of chart reviews, ̂ was estimated as a/n 1 . Then, the size of negative samples (n 0 ) in Stage II was determined by one of the following 4 options. In Option A, n 0max , or the largest value of n 0 that can achieve the pre-specified precision ( * ), is calculated by Equation (7); in Option B, n 0 (0.7) (where * is fixed as 0.7) is obtained by Equation (8); in Option C, n 0 is calculated by Equation (7),

28-30 in
where the information on the population sensitivity (SE) available from the previous study is used as * ; and in Option D, n 0 is arbitrarily selected by the investigator. For Options A-C, the size of negative samples (n 0 ) was selected to attain * =0.2, 0.15 and 0.1 in Equation (7) or (8). In Option C, it was assumed that based on the information from the previous study, We used the RAND function of SAS 9.4 with 10,000 iterations for each of the 36 studies.
We also examined the coverage of 95% and 95% by using the bootstrap method for Studies 1, 4, 7, 10, 13, 16, 19, 22, and 25 (Option A) and Studies 28, 29, --, and 33 (Option D). In the bootstrap method, we selected 1,000 resamples with SAS 9.4 SURVEYSELECT for each of 10,000 iterations. For the bootstrap method, the coverage was estimated as the proportion of 1,000 resamples in which SE was included in the 2.5th-97.5th percentile range of ̂ estimated by equation (3) and as the proportion of 1,000 resamples in which SP was included in the 2.5th-97.5th percentile range of ̂ estimated by equation (4).
As the coverage of 95% or 95% in Equations (1), (2), (5) and (6) was often found to be less than 0.95 (see Results), we conducted post hoc studies to simulate a total of 10,560 fictitious studies with different combinations of * , PPV, NPV, N 0 /N 1 and n 1 and calculated n 0 in Equation (7) with Option A and examined the coverage of 95% in Equations (1) and (5) and the coverage of 95% in Equations (2) and (6) (1) and (5) as well as 95% calculated from Equations (2) and (6) with the bootstrap percentiles shown in the study by Klungsøyr et al. [6].
The current study investigates formulae and designs in validation studies where only simulation data are used. Therefore, no ethics review was needed for the study.
(9) Table 1 shows the medians and 2.5th-97.5th percentile ranges of ̂ and ̂ obtained over 10,000 iterations. The medians of ̂ and ̂ are close to SE and SP, respectively. Table 1 also indicates that a larger negative sample size (n 0 ) is required to attain higher precision (smaller * ), and additionally, the largest possible n 0 from Option A is close to n 0 (0.7) from

Results
Option B. Table 2 shows that as the size of negative samples ( 0 * )used in Option D increases, the predicted precision ( _ or (0.7)) decreases. It also shows that _ is close to (0.7).  (5) for 95% or 1-̂ when the upper limit is set to 1 in Equations (1) and (5) Table 4 shows the medians of ̂, the lower and upper limits of 95% and (the value calculated in Equation (6) for 95% calculated in Equations (2) and (6) or the difference between the 97.5th percentile and median ̂ among 1,000 resamples calculated in the bootstrap method) for 18 studies. The coverage of 95% estimated by Equations (2) and (6)  in the bootstrap method is larger than that of from Equation (6). Table 5 shows the coverage of 95% and 95% from 10,560 post hoc studies. The coverage of 95% is larger than 0.95 in 38.5% of the 10,560 studies when Equation (1) is used but is larger than 0.95 in 84.2% when Equation (5) is used. On the other hand, the coverage of 95% is larger than 0.95 only in 2.5 and 3.4% of the studies, respectively, when Equation (2) and Equation (6) are used. However, 95% from Equation (2) or (6) is larger than 0.93 in more than 80% of the 10,560 studies.

Discussion
When stratified sampling is employed in a validation study, the measures of validity in the original population rather than those in the artificial "population" should be estimated and reported to provide useful information. This is because the validated definition is normally used in the study conducted with the original population. For example, in the study by Klungsøyr et al [6], the sensitivity was 96.8%, and the specificity was 75.6% for the artificial "population" of 5,340 women, which was different from the sensitivity (43.0%) and specificity (99.3%) estimated for the original population.
Our study indicated that the 95% CI of the sensitivity and specificity can be approximated by Equations (1) and (2), respectively. However, we recommend the use of Equation (5) rather than Equation (1) to estimate the 95% CI for the sensitivity because the coverage of 95% from Equation (1) is larger than 0.95 in less than half of the studies shown in Table   3 and Table 5, while the coverage is larger than 0.95 in 15 of 18 studies shown in Table 3 and in more than 90% shown in Table 5. The coverage of 95% was less than 0.95 when Equation (2) or (6) was used. Table 4 implies that the bootstrap method may give a better estimate for the CI of the specificity. However, Equation (2) or (6) may still be of use because the coverage was at least 0.93 in more than 80% of the studies shown in Table 5.
We also presented a formula to calculate the size of negative samples required to attain an intended precision once the chart reviews in Stage I are completed before selecting negative samples in Stage II. In addition, we presented a formula to calculate the precision that would be attained when the investigator uses a predetermined negative sample size.
The approximation formulae for 95% and 95% and formulae to calculate the size of negative samples and attainable precision may encourage the conduction of validation studies with stratified sampling, which can provide some information on sensitivity, NPV and specificity even if the primary purpose of the study is to estimate the PPV of the outcome definition.

Conclusion
We proposed formulae to approximate the 95% CIs of the sensitivity and specificity for validation studies with stratified sampling. We also proposed a formula to estimate the size of negative samples required to attain a pre-specified precision. The formulae may help in the proper conduction and analysis of validation studies with stratified sampling. (12) positive samples; se 0 : value of * that maximizes n 0 in Equation (7).   f Median of the difference between the 97.5th and 50th percentiles of ̂ among 1,000 resamples.
g Proportion of 10,000 iterations in which the 2.5th-97.5th percentile range of 1,000 resamples includes SE. * : intended precision of the sensitivity; _ : maximum possible precision from Equation (9); ̂: sensitivity in the samples; : the value of from Equation (5) or 1-̂ when the upper limit of ̂ is set to 1 and the difference between the 97.5th and 50th percentiles of ̂ by the bootstrap method.
(18) a Median of ̂ from Equation (4) and medians of the upper and lower limits of 95% from Equation (2) (in parentheses).
b Median of from Equation (6).
c Proportion of 10,000 iterations in which 95% includes SP.
d Median of ̂ from Equation (4) and medians of the upper and lower limits of 95% from Equation (6)  f Median of the difference between the 97.5th and 50th percentiles of ̂ in 1,000 resamples.
g Proportion of 10,000 iterations in which the 2.5th-97.5th percentile range of 1,000 resamples includes SP.
̂: specificity in the samples; : the value of from Equation (6) and difference between the 97.5th and 50th percentiles of ̂ by the bootstrap method.

Consent for publication
Not applicable

Availability of data and materials
All data generated or analysed during this study are included in this published article.

Funding
There is no funding source to declare.

Authors' contributions
KK conceptualized this study and was responsible for the methodology and analysis and for writing the initial draft. MI and TY were involved in the review and editing. All authors have read and reviewed the final manuscript.