Estimating the rate of overdiagnosis with prostate cancer screening: evidence from the Finnish component of the European Randomized Study of Screening for Prostate Cancer

Screening for prostate cancer may have limited impact on decreasing prostate cancer-related mortality. A major disadvantage is overdiagnosis, whereby lesions are identified that would not have become evident during the man’s lifetime if screening had not taken place. The present study aims to estimate the rate of overdiagnosis using Finnish data from the European randomized trial of prostate cancer screening. We used data from 80,149 men randomized to a screening or a control group, distinguishing four birth cohorts. We used the “catch-up method” to identify when the difference in the cumulative incidence of prostate cancer between the screening and control groups had stabilized, implying that the screening has no further effect. We define the overdiagnosis rate to be the relative excess cumulative incidence in the screened group at that point. As an independent method, we also examined the diagnosis rates of T1c tumors as an indicator of early tumors detected by PSA. The estimates of overdiagnosis rates from the catch-up method using the full period of available follow-up ranged between cohorts from 2.3% to 15.4%, and the T1c analysis gave very similar results. Some overdiagnosis has occurred, but there is uncertainty about its extent. A long follow-up is required to demonstrate the full impact of screening. We evaluated the overdiagnosis rates at a population level, associated with being offered screening, taking account of contamination (screening among the controls). The overall evaluation of screening should incorporate mortality benefit, cost-effectiveness, and quality of life.


Introduction
Prostate cancer is the second commonest cancer in males worldwide, but different regions have varying incidence and mortality. The risk of prostate cancer is higher in black men but is low in Asian men [1]. In the USA, the most commonly diagnosed cancer in men is in the prostate. The American Cancer Society [2] estimated that during 2018, about 164,690 new cases of prostate cancer would be diagnosed in the USA. In Canada, it has been estimated that about 21,300 men would be diagnosed with prostate cancer annually.
In 1986, PSA testing was approved by the Food and Drug Administration (FDA) to monitor the progression of prostate cancer. In 1994, the FDA approved the use of PSA in screening for prostate cancer in asymptomatic men. As a result, the incidence rates for prostate cancer increased substantially in the 1980s and 1990s, primarily because of widespread adoption of the PSA test. However, a recent analysis showed 1 3 that the incidence of distant-stage prostate cancer increased among men ages 50-69 between 2004 and 2012 [4]. Moreover, an ACS guideline updated in 2001 indicated there was still uncertainty about the overall value of periodic testing in terms of reducing the risk of death from prostate cancer. A randomized trial conducted in the USA found no mortality benefit [5], whereas a contemporaneous trial conducted in eight countries in Europe showed a 20% reduction in prostate cancer mortality [6]. At that point, the ACS recommended that PSA testing was not recommended for asymptomatic men who had less than a 10-year life expectancy, and physicians were required to provide detailed information to their patients about the risks and potential harms of early detection. Also, a large cluster-randomized trial in the UK showed no mortality benefit at 10 years, but it was based on a single screening round with low compliance (36%) [3].
The European Association of Urology recommends the provision of PSA testing to informed men with elevated risk of prostate cancer, with follow-up intervals for men depending on their initial PSA levels. In 2017, the US Preventive Services Task Force recommended that men aged 55-69 should be informed about the benefits and harms of PSA testing, in order to decrease the number of men with aggressive disease being missed. All these findings imply that systematic population-based PSA testing is not strongly recommended [7].
While the benefits of PSA testing remain controversial, there has also been concern about the adverse effects of PSA testing, particularly with respect to the question of overdiagnosis. Specifically, PSA testing may detect some cancers which would not have been identified during a man's lifetime had screening not taken place [8]; the diagnosis of such lesions through screening clearly provide no mortality benefit. Such overdiagnosis could result from the presence of slow growing or indolent tumors, which can exist asymptomatically for many years. In these cases, screening potentially leads to harmful effects, such as erectile dysfunction, urinary incontinence, and others. However, at the time of screening, it is impossible to recognize which particular cases of cancer have been over-diagnosed and should have been left untreated. Even for aggressive cancers, it is possible that men will die before the cancer has time to progress; in such cases, this would also amount to over-diagnosis. Accordingly, in order to evaluate the overall benefits or harms of prostate screening, we need to quantify the extent of overdiagnosis in a screening program.
Two main approaches have been suggested to estimate the overdiagnosis rate: modeling of disease transition rates, and the "catch-up" or excess-incidence method [9]. The first approach models the hypothetical or counterfactual patterns in prostate cancer that would arise with or without screening, then comparing their results to estimate the rate of overdiagnosis. Examples of this method include: MISCAN [8,10] which is a microsimulation model that simulates individual life history as a Markov process of states and transitions to calculate the over-detection rate by deriving the lead time; the UMich (University of Michigan) method [11], in which a statistical model captures the features of registered prostate cancer cases before and after PSA screening was used, then predicts lead time and subsequently the overdiagnosis rate; and the FHCRC (Fred Hutchinson Cancer Research Center) method [12,13] in which a microsimulation model links an individual's PSA levels with the progression of his prostate cancer.
In all of these simulation methods, investigators have to find a balance between complexity and transparency in choosing an appropriate model. The complexity dimension can range from simple (involving only a few features of the disease) to complex (referring to many disease features and adopting many transitional probabilities). If a complex model is used, it may be difficult to evaluate the risk of bias in the results, due to a lack of transparency. On the other hand, a simple model may not capture all the important features of the disease process, and its interface with screening.
The second approach, the so-called "catch-up" method, uses observed excess incidence rates, and the cumulative difference in disease incidence between the screening and the control groups. In a review of alternative approaches to assessing overdiagnosis, the catch-up method has been described as the preferred approach [14], and that it is particularly applicable to situations where randomized trial data are available, such as in the Finnish data employed in the present analysis. Taking advantage of the data from a randomized control group allows one to more reliably estimate the expected disease pattern in the counterfactual scenario where screening has not been used.
In this study, the catch-up method was used in the Finnish component of the European Randomized study of Screening for Prostate Cancer (ERSPC) [15]. In this trial, men were individually randomized to be offered PSA testing (the screened group), or to a control group where screening was not offered. By virtue of the randomization, we can assume that the two groups have the same underlying risk of prostate cancer. We estimated the extent of cancer overdiagnosis by examining the pattern over time in the cumulative difference in the incidence rate of prostate cancer diagnosis between the screening and the control groups, during the follow-up period after the end of the screening intervention. We used regression methods to assess the point during the followup period when the difference in the cumulative incidence for all prostate cancer diagnoses had stabilized, indicating that the impact of the screening intervention had worn off. We also verified the results using a separate analysis of stage T1c tumors (defined as early, clinically inapparent, non-palpable cancers). The estimates of overdiagnosis rates reflect the comparison between the intervention and control groups as a whole, so in other words they evaluate the effect of being offered to participate in a screening program or not. As such, any PSA testing that occurs in either group outside the trial itself is taken into account.

Methods
Data were abstracted from the Finland section of the ERSPC, which is a multi-center, randomized screening trial between an intervention arm offered PSA screening and a control arm without an intervention. In Finland, one of eight participating countries, 80,458 men aged 55-67 years were randomized to a screening or a control arm, distinguishing four birth cohorts: 1941-44, 1937-40, 1933-36, and 1929-32. Men in the three youngest cohorts in the screening group were offered up to three rounds of prostate screening at fouryear intervals, in 1996-99, 2000-2003, and 2004-2007; the final round excluded men aged > 71 years; men in the oldest cohort were offered only two rounds of testing, starting in the same year. A PSA level 4.0 ng/ml was used as the indication for biopsy. For men with PSA between 3.0 ng/ ml and 3.99 ng/ml a digital rectal examination was initially offered as a supplementary (reflex) test in 1996-1998, and since 1999, free/total PSA ratio was used (with a cut-off of 0.16). In this paper, data obtained during follow-up of trial participants was used for 18.6 years after randomization. Figure 1 shows a schematic representation of the expected patterns of cumulative incidence of prostate cancer in the screened and control arms of a randomized trial, in either the absence or presence or overdiagnosis. Before screening begins, both arms accumulate cases at the same expected rate. However, during periods of screening, cases are found in the screening arm earlier than would otherwise have occurred; the degree to which the date of diagnosis is advanced is known as the lead time. The earlier distribution of diagnosis dates in the screened group manifests as a difference in cumulative incidence in that group, relative to controls. The difference may be further enhanced during later rounds of screening. When the screening program ends, cases are then diagnosed more frequently in the controls than in the screened arm, because the pool of cases in the screened arm has been somewhat exhausted, and the controls experience their diagnoses later than in the study arm. In the absence of over-diagnosis, as in Fig. 1a, one expects all the control counterparts of the screened cases (with early diagnosis dates) to eventually be diagnosed at a later time. After some time, the screening effect will have dissipated, and the cumulative incidence in the controls will "catch up" with that in the screening arm.
In contrast, if there has been over-diagnosis of some cases in the screened arm, their expected control counterparts are never diagnosed, and consequently the cumulative incidence in the control arm always lags behind that of the screening arm. Conceptually, at some point, the cumulative difference in incidence between the screening and control arms will stabilize, and at that stage the cumulative difference will represent the number of over-diagnosed cases in the screened arm (Fig. 1b). We define the estimated overdiagnosis rate as the cumulative difference in incidence at this "stability point", divided by the cumulative incidence in the screened group.
The challenge is to determine when (or if) catch-up has occurred. We modeled the differences in the year-specific incidence rates with spline regressions. Using year-specific incidence, rather than the cumulative incidence difference, has the advantage that the incidence data points are mutually independent. We attempted to determine the stability point by identifying when the slope of the year-specific rate differences was at or close to zero. Our initial impression was that some of the trends in the Finnish data were not clear-cut, and that it might therefore be empirically difficult to define when stability had occurred. Accordingly, we also evaluated the performance of the spline regression method with simulated data. In the simulations, the year-specific incidence rates of each cohort were assumed to follow a Poisson distribution, which could be approximated by a Normal distribution. We assumed that the ideal pattern of incidence rate differences for the spline regressions would demonstrate patterns approximately as shown in Fig. 2, with the time axis starting at the end of the screening program. In the model of Fig. 2, there are up to three linear segments (or splines), with two join points (or 'breakpoints'). The sharp initial decrease occurs because of the early depletion of the pool of cases in the screened group. Then, for some period of time, the screening arm accrues cases at a lower rate than the controls. Finally, as the screening effect wears off, the control incidence rate converges to and eventually equals the rate in the screened group, and catch-up is then declared to have occurred. The rate difference at the catchup point is zero, and hence we would conclude that there had been no over-diagnosis. However, if the rate difference stabilizes at a non-zero value, that value will provide the estimated extent of over-diagnosis.
The idealized three-segment models in Fig. 2a can be fitted if there are a sufficient number of data points for each spline segment, and if the follow-up period after the end of screening is long enough to actually observe stability in the rate difference, once the effect of screening has dissipated. If the data were insufficient to fit this model, a compromise two-segment model was adopted, as in Fig. 2b, in which there is not enough data to distinguish the second and third segments of the model in Fig. 2a. If the follow-up appears to have ended ends before stability of the rate difference can be identified, then a simpler model with only two splines and one join point was adopted, eliminating the final segment in the model of Fig. 2b.
In the simulations, we repeatedly fit the various spline regressions, to evaluate the performance of that method. We sampled the distributions of the year-specific incidence rates in the screening and control arms, based on the numbers of detected prostate cancer cases and the numbers of men at risk in each study year. The variance for each distribution was taken to be the same as the empirical mean rate, assuming Poisson distributions for the numbers of cases.
Because the rates in the screening and control groups are statistically independent, the variances of the rate differences can be taken as the sum of the two group-specific variances. Then, by appealing to the Central Limit Theorem, the yearspecific rate differences were assumed to approximately follow a Normal distribution with this combined variance. For each simulated sampled of data points, we attempted to fit the spline regression, and thus to estimate the catch-up point. Each simulation scenario was initially repeated 100 times, but if the number of converged regression fits was less than 50, we increased the number of simulation runs to 200, to acquire sufficient converged solutions with a specified number of spline segments. The final estimate of each parameter was taken as the sample mean calculated from the simulated set of fitted spline regressions if the distribution of the parameter was symmetric, but otherwise the median was used.
The alternative two-and three-segment spline models are not hierarchical, and we required a way to choose between them, given the available data. Accordingly, both models were fitted, with initial values of the breakpoints (which are required for the iterative fitting of the spline regressions), based on visual impressions of the plotted data. The Akaike Information Criterion (AIC) was then used as a suitable metric to select the preferred model in a consistent way for each case. The AIC provides a way to consider the trade-off between the goodness-of-fit of each model to the data and the model complexity. In particular, a sufficiently superior fit  Schematic spline regression models with a three or b two segments for the year-specific incidence rate differences to the data is required in order to justify adopting the more complex three-segment model, in comparison to the simpler two-segment model. The model chosen by this criterion then gives estimates of the times of each breakpoint, and the slopes of each spline segment (to be denoted as slope1, slope2, and slope 3, as appropriate).
In the three-segment models (as in Fig. 2a), the first break point was conceptualized to be when the year-specific rate differences had reached their lowest point, and the second breakpoint is when the cumulative incidence rate difference has become stable, thus indicating that the impact of the screening intervention has dissipated; that time was taken as the catch-up point.
The AIC criterion will sometimes lead to a preference for the 2-segment model in situations where the transition between the second and third segment is not easily discerned. In some cases, even if the third segment slope is not significantly different from zero, that may not be sufficient to justify adopting the 3-segment model. One could, for instance, have a situation where slope2 and slope3 were very similar (and significantly different from zero or not, as the case may be), in which case the AIC would indicate a preference for the simpler two-segment model.
If the two-segment model was selected, then our best estimate of the catch-up point was based on the fitted incidence difference using that model, after the maximum period of follow-up. As will be seen from the fitted 2-segment models in both of the two older cohorts, the difference in incidence was close to zero at the end of the follow-up period, which implies that we have a reasonable estimate of overdiagnosis at that point.
A prerequisite for having a well-defined "catch-up" point is that there are enough years of follow-up, which ideally needs to be at least as long as the longest lead time that screening can provide [16]. For prostate cancer, the mean lead time has been estimated as between about 5 and 8 years in various analyses and populations [8,17,18]. Thus, the available follow-up time of over 18 years since randomization (and indeed for many men since the "last screen") likely exceeds the lead time for most cases. However, to the extent that catch-up has still not fully occurred, there might still be some tendency to overestimate the overdiagnosis rate.
We defined overdiagnosis to be the detection of cancers by screening that would not have become clinically evident in the absence of screening. In situations where the catch-up point could be identified, we estimated the overdiagnosis rate as (I s −I c )/I s , where I s and I c are the cumulative incidence rates in the screened and control groups, respectively, at the catch-up point. A 95% confidence interval for the rate of overdiagnosis was calculated as I s − I c ± 1.96 where s s and s c are the standard errors of the corresponding cumulative rate differences.
In addition to examining the cumulative incidence of all prostate cancer diagnoses, we also carried out a separate analysis of T1c tumors, which are typically asymptomatic. The empirical values of the difference in the cumulative incidence of these tumors were compared to the catch-up estimates of overdiagnosis at the latest points during the follow-up. The T1c analysis will reflect PSA testing both within the trial and outside it, as by definition a T1c cancer is a clinically inapparent tumor that is not palpable in digital rectal examination or visible in imaging (but not an incidental finding in transurethral resection of the prostate as T1a, and T1b); it is frequently detected because of an elevated PSA as it is too small to cause symptoms.
It is important to recognize that our estimates of overdiagnosis rates reflect comparisons between the intervention and control groups as a whole; so in other words they evaluate the effect of being offered to participate in a screening program or not. As such, any PSA testing that occurs in either group outside the trial protocol itself is taken into account, including 'contamination' testing (screening-or symptomdriven) of men in the control group.

Prostate cancer incidence
Data used in this study was taken from the Finland data in the ERSPC, conducted in men born from 1929 to 1944. A total of 80,458 men were randomized to screening or control groups. Table 1 shows the sample sizes and summary statistics on the distribution of follow-up times available for the 1929-32, 1933-36, 1937-1940, and 1941-1944 cohorts; all men are followed indefinitely, until death, or individuals were censored once a prostate diagnosis had occurred.
The follow-up is summarized in Table 1 in two ways, since randomization, and since the last screen, each by study arm and birth cohort. Information on the follow-up since the last screening intervention took place is useful to examine if the trial had continued long enough so that men were being followed beyond their expected lead time. In the intervention arm of the trial, the "last screen" was defined as the date of the latest test for those who attended the screening program, or by the date of the most recent invitation for those who did not attend as a result of that invitation. In the control arm, PSA testing sometimes occurred outside the trial itself, but many of the control men were never tested. Therefore, and for comparability with the intervention arm, we artificially defined the date of the "last screen" for control men as the corresponding date of a randomly chosen man with the same birth year in the screening arm. Table 2 provides more detail of the number of men still being followed at the start of each year of follow-up, again since randomization and since the last screen. From Tables 1 and 2, it is evident that large numbers of men in both arms of the study were still being followed for many years after randomization, and even after the screening intervention had ended. As mentioned earlier, the long follow-up in this trial clearly exceeds the expected lead time, and will also exceed the individual lead times for the majority of prostate cancer cases (although, of course, individual lead times are not observable). These data also show that the ERSPC trial was the largest and had the longest period of follow-up for any randomized trial of prostate screening. Figure 3 shows the cumulative incidence, the year-specific incidence rate, and their differences, by birth cohort, and trial arm. Immediately evident is the fact that the cumulative incidence is progressively higher for the earlier birth cohorts, as would be expected [19]; accordingly, all our analyses were done separately for each cohort. The cumulative incidence plots do appear to support our initial conceptualization for their expected behavior, as displayed in Fig. 1.
The data for the 1929-32 cohort (Fig. 3a) appears to approach a zero cumulative difference between the screening and control groups, while the other cohorts retain nonzero differences. The two peaks in year-specific incidence correspond to the two screening rounds in the study protocol (during years 1 and 5 of follow-up) for this cohort. After the end of screening in follow-up year 5, the screened group incidence fell below the controls because of the lead time effect, and then the groups gradually converged at a catchup point of about 16 to almost 19 years of follow-up since randomization.
In the three later birth cohorts, there are 3 years of excess incidence in the screening group corresponding to their screening protocol, followed by a deficit after year 9. The deficit continues for several years, then the screening group incidence gradually returns to that of the controls (Fig. 3b,  c, and d). Figure 3 also shows the cumulative excess incidence rates. In each cohort, the cumulative excess achieves its maximum value at the time of the last screening round. None of the cohorts clearly attain a zero cumulative incidence difference by the end of follow-up, suggesting that some overdiagnosis may be present in each case, but that the effect of screening may persist beyond the last year of follow-up. Table 3 shows the AIC statistic for the various spline regression models in each cohort; smaller values suggest the preferred model, among the cases where convergence of the model fitting was successful. On this basis, the appropriate numbers of breakpoints were defined as 1 for the 1929-32 and 1933-36 cohorts, and 2 for the 1937-40 and 1941-44 cohorts.

1929-32 cohort
Based on the AIC statistic, the preferred model for this cohort has one break point, at the point where the rate difference has its lowest value.
Among the 100 simulation runs, 98 converged for the spline model with 1 join point; summaries of the model parameters are shown in Table 4. The point when the year-specific rate difference reached its minimum was at  2.29 years. The estimated slope of the second segment was small, but zero was not contained within its whiskers [max (Q1−1.5 × (Q3-Q1), min), min (Q3 + 1.5 (Q3-Q1), max)] [20], (this range is approximately μ ± 2.67σ under a normal distribution assumption) which suggested that it was significantly greater than 0. Figure 4a shows the fitted two-segment model to the observed data. It has a minimum around the second year of follow-up, which is close to the mean value in the simulated samples, and has a subsequent to rise to approximately 0. We conclude that either 'catch-up' may have occurred, but there is insufficient data to define a later breakpoint after which the year-specific rate differences would have completely stabilized at zero.

1933-36 cohort
We adopted the two-segment spline model. All the simulation runs converged, and their estimated parameters are again summarized in Table 4. The minimum rate difference was approximately at 2.4 years after the last screen. Figure 4b shows two-segment model fitting to the observed data with a minimum incidence difference estimated at about 2.5 years of follow-up, but with a slow upward trend after that. The last few years of follow-up show variable incidence rate differences, both above and below zero, so again it is not completely clear if the catch-up point has been reached.

1937-40 cohort
We used a three-segment spline model with two break points. In order to acquire a larger sample of converged simulations, the number of replications was increased from 100 to 200; the results are summarized in Table 4, for the 80 simulations (40%) which converged. Non-convergence often occurred because there was only one data point in some time segments, or because two breakpoints were close to each other.

Year-specific Rate
Year

Year-specific rate
Year of follow-up Fig. 3 Prostate cancer cumulative and year-specific incidence rates in screening and control arms The distributions of the estimated slopes showed positive skewness for breakpoint 1, and negative skewness for breakpoint 2, but we adopted the mean values as the preferred summary, because qualitatively these values were close to their corresponding median. The minimum difference in year-specific incidence rates was reached just over 2 years after screening ends, then there is a slowly increasing trend until about 8 years. The mean slope of the third segment was positive over the short period of remaining follow-up data available.
The three-segment model fitted to the observed data is shown in Fig. 4c, indicating a rapid drop in the cumulative incidence rate difference for the first 2 years, and then a period of about 7 years with an approximately stable deficit in negative values; an increase is seen in the last year of available data, suggesting that a stable catch-up point may not yet have occurred.

1941-44 cohort
The pattern of year-specific rate differences for this cohort was similar to that of the 1937-40 cohort. In this case, about 70 (30%) of the three-segment model simulations converged, with non-convergence again occurring when there was only one data point in one or more segments or two closely-spaced breakpoints. Mean values were used to estimate breakpoints and slopes, because they were close to their corresponding medians in all cases.
The three-segment spline regression models fitted to the cohort data are displayed in Fig. 4d. The small difference between slope2 and slope3 illustrates the difficulty of identifying the time of the second break point, and this also explains why the standard deviation of join point 2 is much larger than for join point 1. Once again, we could not definitively identify if catch-up had occurred. Table 5 shows estimates of the absolute and relative overdiagnosis rate, based on the cumulative incidence difference between the screened and control groups, for various periods of time since the last screen. The absolute cumulative incidence rate difference (i.e., the cumulative excess risk of prostate cancer) for men born in 1929-32 was 0.004 (95% confidence interval: −0.011, 0.019) at 14 years since the end of screening. Compared to the cumulative incidence in the screening group, the relative overdiagnosis rate was For men who started screening at age 63-66, 59-62, and 56-58, the cumulative incidence differences after 10 years of follow-up after the last screen were 0.026, 0.015, and 0.010. The corresponding relative rates of over-diagnosis were 15.4%, 11.4%, and 10.3%, respectively. This suggests  proportionally greater absolute differences in incidence among older men, and with correspondingly higher rates of over-diagnosis, in these three cohorts, who each had three screens offered. However, the oldest cohort (born 1929-32), which was offered only two screens, does not reflect this trend.

Estimated over-diagnosis rates
A difficulty in interpreting these estimates of overdiagnosis is that PSA testing has occurred in the control group of the ERSPC, and also in the intervention group outside the regimen of the trial itself. Furthermore, a PSA test is used in the diagnostic process for almost all cases of prostate cancer. Finally, it is not possible to say, from the available data, whether some of these tests were true screens in asymptomatic men, and which tests might have been administered in response to symptoms, i.e., for clinical indications. As noted elsewhere, testing within the control group would probably tend to cause under-estimation of over-diagnosis. Despite this, it is not possible to devise a correction for this effect, because of the uncertainties surrounding the motivation for particular tests. In response to this concern, we carried out an additional analysis of diagnosis rates for prostate cancer T1c tumors, which are defined as clinically inapparent tumors that are not palpable nor detected in surgery for benign prostatic hyperplasia (transurethral resection of the prostate). This means that most early tumors detected by PSA testing would be classified as T1c.
We constructed life tables for T1c diagnoses in both arms of the trial, again with censoring when a prostate cancer diagnosis or death had occurred. From the cumulative incidence rates, we calculated relative overdiagnosis rates in the last year of follow-up data. The relative overdiagnosis rates based on T1c diagnoses for the 1929-32, 1933-36, 1937-40, and 1941-44 birth cohorts were 2.3%, 16.3%, 14.6%, and 12.7% respectively, agreeing very closely with the estimates from the catch-up method using all prostate diagnoses, which were 2.3%, 15.4%, 11.4%, and 10.3%. This supports the notion that the catch-up estimates are valid, in the sense of allowing for all tests in all men in the trial, and with the objective of estimating over-diagnosis in the trial groups as a whole.

Discussion
Based on our results, we found that the available years of follow-up (over 18 years since randomization, and 10 years after the end of their last scheduled screening round) in the three youngest cohorts in the trial were not quite enough for us to definitively confirm whether the incidence difference between the screening and control groups had stabilized or not; the oldest 1929-33 cohort, with 14 years of followup after the end of their last screening round, shows somewhat more convincing evidence that catch-up of the control group had occurred. Elsewhere, it has been estimated that 10-14 years of follow-up may be required [21]. It is possible that the cumulative incidence for prostate cancer will continue to reduce with further observation. If so, the best available estimate of the overdiagnosis rate would be calculated from the data in the last year of follow-up, but this would be an overestimate if the screen effect is still wearing off, even at that late stage. Table 6 summarizes the estimates of overdiagnosis obtained in other studies; these range widely, from 2.9% to 88.1%. Such substantial variation might be partly explained by the fact that in deriving these estimates, there are many  [22] 1988-1998 US SEER9 29% in Whites, 44% in Blacks Gulati [13] 1975-2005 US SEER9 2.9-88.1% Telesca [17] 1975-2000 US SEER9 22.7% in Whites, 34.4% in Blacks Wu [23] 1996-2005 ERSPC Finland 3.4% Pathirana [24] 1982-2012 Australian cancer database 41% Gulati [25] 10 US clinics 8.8%-60.6% Gulati [26] 4%-78% Excess incidence Zappa [27] 1992-1995 Italy 51% 25% for 2% annual incremental incidence Schröder [28] 1991-2006 ERSPC 48 cases among 1410 screened men Ciatto [29] 1991-1994 Italy 66% Fenton [30] ERSPC 33.2% PLCO 16.4% CAP trial 40.7% possible choices for the denominator [9]. Studies by Etzioni et al. [22], Telesca et al. [17], and ourselves report overdiagnosis as a percentage of screening-detected cases. Others presented the number over-diagnosed as a proportion of the total number of cases detected, or the total number invited to screening. The variation may also be attributed to different methodologies being employed. In most modeling studies, investigators used disease incidence rates in a screening group to estimate the distribution of the lead time, or to infer natural history of the disease, and subsequently estimate the corresponding frequency of overdiagnosis. Finally, these studies have involved a wide variety of participants, from young men with high PSA levels to old men with low PSA levels.
There are two major challenges in estimating the excess cancer incidence in a screened group: first, how to estimate the incidence in unscreened persons; and second, the desire to have sufficient follow-up years. Concerning the former, Zappa et al. [27] estimated the incidence without screening based on the pre-screening trend, while Schröder et al. [28] used data from a randomized clinical trial. Concerning the latter challenge, having a long period of follow-up may make it difficult to avoid men in the control arm from being screened during the study years, and the same problem of additional screening in the intervention arm of a trial also exists. Therefore, when possible, the screening contamination rate of both groups should be considered. In the Finland data, approximately 10% of men in the screening group had a PSA test before their first screen in the trial [31], (although, being pre-randomization, these tests were equally distributed between the study arms). More recently, it has been estimated that 50% of the control men in Finland have had a screening test at least once during the first eight years of follow-up [32]. Such a high contamination rate in the control arm will tend to reduce the excess incidence between the two groups, and thus lead to a reduction in the overdiagnosis rate, if even the follow-up is long enough for the incidence difference to become stable. Nevertheless, our analysis has had the advantage of estimating overdiagnosis compared to a randomized control group that was not offered screening as part of the trial. Other approaches, such as comparing outcomes in screened individuals with the pre-screening trend, do not have the obvious benefits of randomization. In addition to differences in their analytic methods, further reasons for the variation between the results of studies summarized in Table 6 include differences in the screening protocols and techniques.
A key point here is that we are evaluating the impact of being offered screening, and not of actually being screened necessarily. The study intervention in the ERSPC is an offer to be screened, and not to necessarily to attend the screening program. This distinction is very similar to the concept of 'intention-to-treat' in randomized trials of therapy, in which there may be departures from the study protocol such as in the form of non-compliance or crossovers in treatment. In the analogous situation of a screening trial, participants may or may not adhere to their randomized assignment (being screened or not), in either group.
Our estimates of overdiagnosis rates therefore reflect comparisons between the intervention and control groups as a whole, i.e., in an effectiveness context. In other words, they evaluate the randomized comparison of groups being offered to participate in a screening program or not, regardless of whether men were actually screened or not. The data from each group reflects their entire experience, which will include being screened or not, inside or outside the trial itself. Any PSA testing that occurs within either group but outside the trial protocol is part of that entire experience, and can indeed affect the estimated overdiagnosis rate. Overdiagnosis can actually occur in individual men within either group, but it is not identifiable at the individual level. However, our randomized comparisons reveal the overall difference between the overdiagnosis rates for the two trial groups in an unbiased way.
The fact that some PSA testing also took place in the controls is an important factor in the interpretation of our results. Because of the way the testing data was accessed for the community-based control men, we do not know if any given PSA test in that group was carried out as a true screen (asymptomatically), as opposed to symptom-driven testing. Furthermore, PSA testing is involved in the process of making almost all prostate cancer diagnoses, including in the intervention group, and again we cannot tell which particular tests should "count" as screens in either group. There will surely have been some 'contamination' of the control group by true screening tests, and although PSA testing among the controls was quite frequent [33], we cannot say how often this occurred as true screens. The same is true of the intervention group. Because of these uncertainties, it is not possible to 'correct' or adjust for non-screening PSA tests carried out on the trial participants. Such an adjustment, if it were possible, could lead to estimates of the prostate cancer diagnosis rates among individual men actually screened versus a counterfactual scenario where screening did not take place, in other words as an efficacy comparison, but not one whose validity is protected by the randomization.
We therefore carried out an additional analysis of T1c tumors, which are defined as early, clinically inapparent, and not detectable by digital rectal examination or transrectal ultrasound, which leaves PSA as the likely indication for a prostate biopsy. Analysis of these tumors should therefore provide an estimate of the overdiagnosis rate based on true screening tests. We found that the overdiagnosis estimates were very consistent with the main catch-up analysis, and thus they provide supporting evidence for the validity of our main analysis with respect to the impact of offering a screening program.
As mentioned earlier, the Finland component of the European trial of prostate screening has considerable strengths in terms of the randomized trial, design with particularly large sample sizes, and very long follow-up. The follow-up period in this study is very long in absolute terms, and longer than other trials we are aware of. It is longer than for the PLCO or the entire ERSPC study (16-17 years), and substantially longer than for the CAP/ ProtecT trial (10 years) or Quebec trial (11 years). So, this trial appears to offer the best available data on the overdiagnosis question.
Despite these strengths, there are some limitation to our analysis. In our analyses using the excess-incidence method, data were unavailable on the clinical characteristics of patients, such as the method of detection (screen-detected, opportunistic PSA, other incidental, symptomatic), prognostic features (stage, tumor aggressiveness), or subsequent outcome; this information would be required for assessing the factors characterizing overdiagnosed cases at an individual level. It should also be noted that all our estimates of over-diagnosis rates apply, in the first place, to the population studied in the Finnish component of the ERSPC. The importance of this problem elsewhere will potentially vary according to factors such as the ad hoc PSA testing behavior by asymptomatic men, the local distribution of risk factors for prostate cancer, and patterns of other morbidity.
In conclusion, we have examined the feasibility of using regression modeling to find the "catch-up" point when the effect of screening has worn off, and the cumulative incidence difference between screened and control men has become stable. Based on the Finland data, we concluded that the cumulative incidence difference at the last available year of follow-up may have led to some over-estimation of the overdiagnosis rate. Our best estimates of the relative overdiagnosis rate were 2.3%, 15.4%, 11.4%, and 10.3% for the 1929-32, 1933-36, 1937-40, and 1941-44 cohorts, respectively. Theory suggests that the overdiagnosis rate might increase with age, because of the combined effects of a higher detection rate and higher rates of other causes of death in older men [34]. However, the lower over-diagnosis rate for the oldest men in our results could be explained by the fact that there were only two screening rounds for the 1929-32 cohort. Also, recall that we estimate over-diagnosis rates based on the entire experience of each study arm, including PSA tests in either arm that may or may not be associated with the screening trial itself. Finally, note that we have estimated rates of relative overdiagnosis; further work on this topic might consider absolute rates of overdiagnosis, and contrast them against the NND (the number needed to detect), i.e., evaluate the number of over-detected cases versus one averted death.