Pre- and post-treatment [18F]FDG PET/CT images of 30 oncologic patients selected from a group of tumor types having representative patterns of FDG-avidity contained a mix of single and multiple tumors on the pretreatment scan (1 tumor, n=6; >1 but < 10 tumors, n=19; ≥ 10 tumors, n=5), and a mix of the four major response categories using PERCIST (complete metabolic response, n=6; partial metabolic response, n=11; stable metabolic disease, n=4; and progressive metabolic disease, n=9).
Sites both with National Cancer Institute Quantitative Imaging Network affiliation and without which did not participate in the previous study with the same data set were recruited by email and conference calls. The dataset was the based on a previous study of reader variability (9).
Thirty anonymized cases of pre- and post-treatment [18F]FDG PET/CT studies (total 60 studies) were distributed along with directions for installing and utilizing the Auto-PERCIST™ software. Approval from the Johns Hopkins Institutional Review Board was obtained, and the need for patient informed consent was waived for this study of anonymized image data.
Measurement
Individual measurements from coupled pre- and post-treatment [18F]FDG PET/CT images from one patient were counted as a read. The coupled pre- and post-treatment measurements for all 30 cases from a single reader were counted as a set of reads. One reader from the central site (reader 1) had full knowledge of the primary tumors, treatment histories and subsequent follow-up results, but all other readers had no knowledge of the patients’ medical histories as the reader is often intentionally blinded in the setting of multicenter trials. For statistical purpose, the measurements by reader 1 were considered as the reference standard for comparison (readreference).
Each reader determined which tumor to measure. The Auto-PERCIST™ loads the PET images and automatically obtains liver measurements from a 3 cm diameter sphere in the right side of the liver to compute the threshold for lesion detection. The default setting is 1.5 x liver mean + 2 standard deviations (SD) at baseline to ensure the decline in [18F]FDG uptake is less likely due to chance and to minimize overestimation of response or progression. For follow up images, the default setting is lower at 1.0 x liver mean + 2SD, to allow detection of lesions with lower SULpeak. If a lesion was perceptible visually but not detected using the default threshold settings, the reader had the choice to manually lower the threshold for detection. The Auto-PERCIST™ would detect all sites with SULpeak higher than the threshold (Figure 1). It was up to the readers to determine whether the detected sites were true tumor lesions or not. The reader could also separate a detected focus of [18F]FDG uptake into separate smaller lesions when needed – to exclude adjacent physiologic [18F]FDG uptake or break down a large conglomeration of tumors into smaller separate lesions. The reader could also add smaller [18F]FDG uptake lesions to make them a single lesion if the reader decided the separate [18F]FDG uptakes were parts of a larger single lesion. The readers were instructed to select up to 5 of the hottest tumors for cases with multiple lesions. The readers could view the PET/CT images on any reading software they preferred, but the measurements came only from the Auto-PERCIST™. The measurements from Auto-PERCIST™ included SULpeak, maximum and mean SUL, number of counts, geometric mean, exposure, kurtosis, skewness, and metabolic volume. After the readers selected and quantified the lesions, the measurements were saved as text files and sent for central compilation and analysis to the Image Response Assessment Core at Johns Hopkins University.
Statistical analysis
The primary study metric was the percentage change in SULpeak (%ΔSULpeak) from baseline to follow-up. Percentage change was defined as [(follow-up measurement – baseline measurement) / (baseline measurement)] x 100. For assessment of up to 5 lesions, the percentage change was computed from the sum of the lesions. Treating both case and site as random-effects, a linear random-effects model was fit via the restricted maximum likelihood estimation method, which estimated variance components of the random-effects in the model. As a measure of inter-rater agreement, the intra-class correlation coefficient (ICC) was computed using the variance components of the random-effects. The ICC was computed as [inter-subject variance / (inter-subject variance + intra-subject variance + residual variance)]. The bias-corrected and accelerated bootstrap method was implemented with 1,000 bootstrap replicates to construct the 95% confidence interval of the computed ICC. The sampling unit was a read.
To assess agreement between the reference reader (readreference) and another reader, the ICC was computed for each pair of the reference reader and 12 other readers. The mean of these ICCs and its range (minimum, maximum) were reported.
Krippendorff alpha reliability coefficient was computed as a measure of agreement between multiple readers for response outcome, which was classified into four ordered major response categories using PERCIST 1.0 as: complete metabolic response (CMR), partial metabolic response (PMR), stable metabolic disease (SMD), and progressive metabolic disease (PMD). The measurements were classified: PMD for SULpeak increase ³ 30% (and 0.8 units) or new lesions; SMD for SULpeak increase or decrease < 30% (or 0.8 units); PMR for SULpeak decrease ³ 30% (and 0.8 units); and CMR for no perceptible tumor lesion. Additionally, Krippendorff coefficient was computed with the response categories being dichotomized into two levels: clinical benefit (CMR/PMR/SMD) and no benefit (PMD); or response (CMR/PMR) and no-response (SMD/PMD). Krippendorff suggests 0.8 as a threshold for satisfactory reliability, but if tentative conclusions are acceptable, 0.667 is the lowest conceivable threshold (10).