Repeatability of 18F-FDG uptake in metastatic bone lesions of breast cancer patients and implications for accrual to clinical trials

BACKGROUND Standard measures of response such as Response Evaluation Criteria in Solid Tumors are ineffective for bone lesions, often making breast cancer patients with bone-dominant metastases ineligible for clinical trials with potentially helpful therapies. In this study we prospectively evaluated the test-retest uptake variability of 2-deoxy-2-[18F]fluoro-D-glucose (18F-FDG) in a cohort of breast cancer patients with bone-dominant metastases to determine response criteria. The thresholds for 95% specificity of change versus no-change were then applied to a second cohort of breast cancer patients with bone-dominant metastases. In this study, nine patients with 38 bone lesions were imaged with 18F-FDG in the same calibrated scanner twice within 14 days. Tumor uptake was quantified as the maximum tumor voxel normalized by dose and body weight (SUVmax) and the mean of a 1-cc maximal uptake volume normalized by dose and lean-body-mass (SULpeak). The asymmetric repeatability coefficients with confidence intervals of SUVmax and SULpeak were used to determine limits of 18F-FDG uptake variability. A second cohort of 28 breast cancer patients with bone-dominant metastases that had 146 metastatic bone lesions was imaged with 18F-FDG before and after standard-of-care therapy for response assessment. RESULTS The mean relative difference of SUVmax in 38 bone tumors of the first cohort was 4.3%. The upper and lower asymmetric limits of the repeatability coefficient were 19.4% and −16.3%, respectively. The 18F-FDG repeatability coefficient confidence intervals resulted in the following patient stratification for the second patient cohort: 11-progressive disease, 5-stable disease, 7-partial response, and 1-complete response with three inevaluable patients. The asymmetric repeatability coefficients response criteria changed the status of 3 patients compared to standard the standard Positron Emission Tomography Response Criteria in Solid Tumors of ±30% SULpeak. CONCLUSIONS In evaluating bone tumor response for breast cancer patients with bone-dominant metastases using 18F-FDG uptake, the repeatability coefficients from test-retest studies show that reductions of more than 17% and increases of more than 20% are unlikely to be due to measurement variability. Serial 18F-FDG imaging in clinical trials investigating bone lesions from these patients, such as the ECOG-ACRIN EA1183 trial, benefit from confidence limits that allow interpretation of response.


Introduction
Breast cancer is the most common malignancy and second leading cause of cancer death in women [1], and bone is the most common site of metastasis in breast cancer [2][3][4][5].The appearance and behavior of bone metastases can be detected on a wide variety of clinical imaging studies (e.g.x-ray computed tomography, bone scan, magnetic resonance imaging, (2-deoxy-2-[ 18 F] uoro-D-glucose) 18 F-FDG using positron emission tomography with computed tomography attenuation mapping (PET/CT.[6]) that are performed for different indications.
Imaging-based response criteria are often used to determine the e cacy of new therapeutic agents in cancer treatment trials.The most commonly used set of criteria in clinical trials is the Response Evaluation Criteria in Solid Tumors version 1.1 (RECIST) [7], which focuses predominantly on the physical dimensions of solid tumors from CT scans, similar to other size-based criteria such as those from the World Health Organization (WHO) [8].However, CT does not evaluate the bone or bone marrow, but only the osteoblastic reaction in healing bone [9].For this reason, RECIST criteria specify that bone lesions without soft-tissue components are non-measurable, non-target lesions.As a result, patients with bone dominant disease are often excluded from clinical trials due to a lack of RECIST measurable disease [10][11][12].
There is active interest in using measures of 18 F-FDG uptake with PET/CT imaging as a biomarker to assess early response to therapy for multiple types of cancer [13,14].For breast cancer, the AVATAXHER trial [15] and recently, the 2019 results of the TBCRC026 trial along with at least 11 other studies [6, 16-26], support using PET imaging as an effective method of measuring early breast cancer response in vivo.
An early effort to de ne PET-based response criteria for clinical trials was led by the European Organization for Research and Treatment of Cancer (EORTC) in 1999 [27].The EORTC response criteria were expanded and modi ed by Wahl and colleagues in 2009 for the Positron Emission Tomography Response Criteria in Solid Tumors, or PERCIST [28].Multiple clinical studies have shown that response assessment by EORTC criteria and PERCIST leads to similar response classi cations.In addition, there are preliminary data that suggest that response assessment by PERCIST is better correlated with patient outcome and may be a better predictor for the effectiveness of new anti-cancer therapies than RECIST [29].However there have only been very limited reported evaluations of the use of PET imaging speci cally for response assessment of osseous metastases from breast cancer [9,30], and an extension of PERCIST to metastatic bone disease is not yet established [21].Peterson et.al [31] evaluated a modi ed version of PERCIST inclusion criteria (mPERCIST) accounting for the lower standardized-uptake-values (SUVs) of osseous lesions compared to the soft-tissue lesions previously studied using PERCIST.This study found that changes in 18 F-FDG-PET uptake during therapy were predictive of time to skeletal-related events (tSRE) and time-to-progression (TTP).
To design effective response criteria, an understanding of the test-retest variability is needed.In this study we prospectively evaluated the test-retest variability of 18 F-FDG-PET uptake in a cohort of breast cancer patients with metastatic bone-dominant lesions (BD-MBC) using the mPERCIST inclusion criteria.The calculated thresholds for 95% speci city of change versus no-change from the test-retest data were then applied to a second cohort of BD-MBC patients who had 18 F-FDG-PET scans both pre-therapy and after start of therapy.The classi cations of change status were compared to those using EORTC, PERCIST, and recently published thresholds for soft-tissue cancers from the QIBA Pro le [32].

Patient selection.
Cohort-1: Repeatability was assessed in a cohort of stage IV BD-MBC patients with stable bone disease that underwent two 18 F-FDG PET/CT studies on the same scanner within a two-week duration or less with no interval change in therapy.Patient and scan characteristics for cohort-1 appear in Table 1.* Type of cancer is invasive ductal carcinoma (IDC), mixed lobular and ductal carcinoma (Mixed), ductal carcinoma in situ (DCIS) or unknown (Unk); † ER/PR/Her2 pathology status is positive (+), negative (-) or equivocal (=); ‡ Treatment is endocrine (Endo), chemotherapeutic (Chemo) or combined with a biologic (Bio), such as a PARP inhibitor.§18 F-FDG scan differences are ∆Days is the days between scans, ∆UT is the difference in minutes of uptake times (time between injection and scanning) between scans, ∆%Dose is the percent difference in dose between scans and ∆[Glc] is the change in blood glucose concentration between scans.

<Table 1>
Cohort-2: A second retrospective cohort of 28 BD-MBC patients with planned standard-of-care therapy (including endocrine therapy, chemotherapy, and biological therapies) were imaged with 18 F-FDG before and within 30 days following therapy.Aspects of this study have been presented elsewhere [31].
Ethics and Consent: Patients in both cohorts were recruited from the Seattle Cancer Care Alliance or the University of Washington Medical center (Seattle, WA), and signed informed consent prior to enrollment.All methods were performed in accordance with the ethical standards as laid down in the Declaration of Helsinki and its later amendments or comparable ethical standards, as approved by our local IRB (Institutional Review Board), Human Subjects and Radiation Safety committees.
PET/CT scanners and calibration.There were three PET scanners used in the study.Cohort-1 patients were all imaged on one of two General Electric (GE) Discovery STE PET/CT scanners [33], with identical reconstruction parameters, where each test-retest study was acquired on the same scanner.In addition to the recommended PET scanner calibration [32], the two scanners were cross-calibrated and quantitative performance was monitored with NIST-traceable reference sources to ensure similar quantitative accuracy [34,35].
Most cohort-2 patients (15) were imaged on the same PET/CT scanner in serial studies.However, due to the addition of the GE Discovery STE PET/CT scanners at our center, thirteen cohort-2 patients were initially imaged on a GE Advance PET scanner [36] and underwent the second scan on a Discovery scanner.We have shown that our calibration and cross-calibration procedures and identical acquisition and reconstruction protocols provide test-retest accuracy comparable to a well-calibrated single scanner [37].
18 F-FDG-PET imaging protocol.The imaging protocol was performed according to clinical standards, consistent with the QIBA 18 F-FDG-PET/CT Pro le [32].Patients fasted for a minimum of 6 hours before administration of 18 F-FDG.Medications that affect bone marrow uptake of the tracer (G-CSF, Epogen, or Procrit) were withheld for 2-3 weeks prior to scanning.The 18 F-FDG dose, obtained from Cardinal Health, ranged from 260-407 MBq (median 350MBq).Images were acquired with a target of 60 min after injection of FDG (actual range 50-70 minutes) using multiple elds-of-view to image from the level of the eye orbits to mid-thigh.
Image analysis.Images from the Advance PET scanner were reconstructed using 2D ltered back projection reconstruction (4.29 x 4.29 x 4.25mm voxel resolution), while images from the Discovery PET/CT scanners used iterative 3D reconstruction (4.29 x 4.29 x 3.27mm voxel resolution).All reconstructions had corrections for dead time, random events, scatter, sensitivity, decay, branching ration, and attenuation.PET images were read by two quali ed and experienced nuclear medicine physicians.
The maximum and peak standardized uptake values (SUVmax and SUVpeak) for each lesion were extracted using the PMOD image analysis software (PMOD Technologies V4.1, Zurich, CH).SUVpeak volumes-of-interest (VOIs) were constructed as a cubic volume of approximately 1.5 cc centered on the maximum voxel (SUVmax) of each bone lesion.The average SUV of the VOI was the SUVpeak value.Both SUVmax and SUVpeak were normalized to lean-body-mass producing SULmax and SULpeak.
Statistical methods.Repeatability of SUVs in metastatic bone lesions in cohort-1 patients was assessed using the procedures described by Velasquez et al. for gastrointestinal cancers [38] and Weber et al. for non-small cell lung cancer [39].Both studies used 18 F-FDG-PET multicenter test-retest exams, as in the current study.A description of the calculated metrics is summarized in Supplementary materials Table S1.Variability was assessed by calculating the difference of paired SUV measurements, and the difference of the logs of the SUV measurements: The difference of the log of the SUV measurements, ∆ i , can be useful where d i does not follow a normal distribution or where the relative differences are found to be proportional to the mean [40].The SUV measurements SUV i1 and SUV i2 are for lesion i at the time of the baseline and the follow-up scans, and are calculated using SUVmax and SULpeak, which are the most common clinical 18 F-FDG-PET biomarkers.The variability of the parameters d i and the log-transformed values, ∆ i , were assessed using Bland-Altman plots.The consistency of d i and ∆ i with a normal distribution were assessed with quantile-quantile plots and Kolmogorov-Smirnov tests.
The log-transformed data were used to calculate the mean percent difference in uptake between scans (% ), within-subject coe cient of variation ( ), the repeatability coe cient (RC), and asymmetric RC limits (-RC and + RC) as described in Supplementary materials, Table S1.The 95% con dence interval (CI) for % (an estimate of bias between scans) did not include 0. However, this was hypothesized to be a sampling effect and to be conservative, the repeatability metrics were also calculated without subtracting the sample mean.This will include any bias into the estimate of variability and increases the associated metrics: the within-subject coe cient of variation with bias included ( ), the repeatability coe cient with bias included (RC 0 ) and the asymmetric repeatability coe cients with bias included (-RC 0 and + RC 0 ).Details of the calculations are provided in the Supplementary materials.
Metrics were calculated using the lesion as the unit of analysis.To account for non-independence of multiple lesions from the same patient, 95% CIs for the repeatability coe cients were calculated using the leave-onepatient-out jackknife method [41].This involves estimating the standard error of the repeatability metric by recalculating the metric after one patient at a time (all lesions from that are excluded at each step) as this assumes the patients are independent but the lesions within patient are not.The Supplementary materials describes the approach in more detail.
PERCIST Quality Control.We applied the PERCIST recommendations for quality control by measuring the mean SUL of a 3 cm spherical VOI in a normal region of the right lobe of the liver to check that the difference between the scans is less than 20% and also less than 0.3 g/ml for both cohort-1 and cohort-2 patients.
Inclusion criteria.The PERCIST criteria for including lesions in evaluations of response to therapy is , where is the mean SUL value of the normal liver region described above and is the sample standard deviation of the VOI.As we have previously noted [31,42], bone lesions appear to have lower average SULpeak values and lower coe cient of variation than soft-tissue lesions previously studied using PERCIST.In addition, it has been shown that the standard deviation of a VOI from a single image is not related to the true noise, i.e. the noise measured from multiple images of the same object [43].For these reasons we proposed a modi ed PERCIST (mPERCIST) lesion inclusion criteria for bone lesions de ned by liver .
Cohort-2 patient data was used to assess the impact of PERCIST and mPERCIST thresholds for inclusion in studies, as well as the use of cohort-1 bone lesion ± RC for the determination of response to therapy.The PERCIST approach uses the concept of a 'target' lesion to determine response, where only the percentage difference in SULpeak between the tumor with the highest value in study 1 and the tumor with the highest value in study 2 (i.e.not necessarily the same tumor) is used as the classi er for response.The criteria from EORTC and QIBA were also included where appropriate.

−
Cohort-1 characteristics.Nine female breast cancer patients were enrolled in cohort-1 with an average age of 51 years (median 55, range 32-62) with metastatic bone disease.Patients had a mixture of sclerotic, lytic, or mixedtype lesions.Most of the patients were postmenopausal (7/9, 78%) with invasive ductal carcinoma (6/9, 67%).Most patients had ER positive disease (8/9, 89%), while some were HER2 negative (4/9, 44%).Seven patients were on therapy before enrolling in the study, and two had no therapy prior to the repeatability scans.For the patients that were on treatment, there were no changes to treatment between the two scans.The injected dose of FDG ranged from 305 to 396 MBq for both test and retest scans (mean 368 MBq ± 20 MBq).The median time between scans was 8 days (range 2-14).Average glucose level for the rst scan was 94 mg/dL (range 88-104) and for the second scan was 92 mg/dL (range 89-96).The uptake time from tracer injection to the onset of imaging averaged 61 minutes (range 58-70 minutes for all scans), while the difference in uptake times between scan1 and scan2 per patient ranged from 0 to 6 minutes.Cohort-1 patient and scan characteristics appear in Table 1.
Repeatability of Bone Lesion FDG Uptake Values.
Individual SUVmax and SULpeak test-retest measurements for 38 lesions from 9 patients in cohort-1 are provided in Table S2 of the Supplementary materials.The median number of lesions per patient was 5 (range: 1 to 9 lesions).An example test-retest FDG image set from cohort-1 is shown in Fig. 1, and illustrates the consistency of SUV measures between the scans.Also shown is a cohort-2 example of response to therapy as assessed by SUV.

<FIGURE 1>
For quantitative analysis, Bland-Altman plots of individual lesion differences for SUVmax and SULpeak are shown in Fig. 2. The corresponding Bland-Altman plots for within-patient averages of lesions are shown in Supplementary materials Figure S1.

<FIGURE 2>
The tests of normality of the differences, using both quantile-quantile plots (Supplemental materials Figure S2) and Kolmogorov-Smirnov tests (p = 0.88), showed that all the results were consistent with a normal distribution.
The Bland-Altman plots above indicated a potential increase in variance of the SUV differences as a function of the average value.This dependence was not apparent in the difference of the natural logarithms of the SUV values.
The derived repeatability metrics for metastatic bone lesions in breast cancer patients using log-transformed SUVmax and SULpeak measurements, which are normally distributed, are provided in Table 2.The repeatability metrics for other extracted PET parameters, SUVpeak and SULmax, are presented in Supplementary materials Table S3.The 95% con dence interval of the average difference of log-transformed data, , was found to not contain zero.This was hypothesized to be a sampling effect and RCs were recalculated without subtracting the sample mean, as described in the Supplementary materials.This approach led to a slightly more conservative (i.e.larger) estimates of the repeatability coe cients and coe cient of variation, denoted in Supplementary materials Table S2 as ± RC 0 and wCV ∆0 .

<Table 2>
PERCIST Quality Control: Cohort-1.For cohort-1, the average liver SULmean was 1.6 g/mL (range 1.2 to 2.0 g/mL) in the rst scan and 1.6 g/mL (range 1.3 to 1.8 g/mL) in the second scan.The average difference between scans for 18 F-FDG uptake in liver was − 0.02 SUV g/mL (range − 0.21 to 0.15 g/mL).The differences in patient liver SUL values between scans were well under the threshold of 0.3 g/ml suggested by PERCIST guidelines.
Cohort-2 characteristics.Patient and scanning characteristics of the 28 patients in cohort-2 are presented in Supplemental materials Table S4.After baseline 18 F-FDG imaging, patients received different therapies before post therapy PET imaging and were followed clinically thereafter.There were 146 metastatic bone tumors identi ed by a combination of 18 F-FDG-PET and CT imaging [31].PERCIST Quality Control: Cohort-2.For cohort-2, 3 of the 28 patients did not meet the PERCIST quality control requirement (i.e. the difference between scans of the mean SUL for a liver ROI is less than 20% and is also less than 0.3 g/ml), and one patient had interpretable liver results from the second scan.
Assessment of Inclusion Criteria.The PERCIST threshold allowed assessment of 23 patients, and the mPERCIST threshold allowed assessment of 26 patients of the 28 patients in the cohort (Supplementary materials Table S5).We note that of the 3 additional patients included by the mPERCIST criteria, one did not meet the PERCIST quality control requirement for liver (Case 50 in Table S5).While the PERCIST approach uses only the change in the single target lesion(s) to determine response, we also evaluated the impact of the change in inclusion thresholds on all 146 metastatic lesions in the 28 patients and found that the PERCIST threshold allowed assessment of 76 of the bone tumors (52%).The mPERCIST threshold allowed assessment of 102 (70%) of the bone tumors, a substantial increase.These changes for the target lesion FDG SUVmax are illustrated in Fig. 3 along with thresholds for partial response (PR) and progressive disease (PD) based on QIBA, PERCIST, EORTIC and the ± RC 0 threshold developed from cohort-1 test-retest study (-RC 0 = -16.3%for PR and + RC 0 = 19.4% for PD).
Assessment of Response thresholds.The response criteria developed from cohort-1 test-retest studies of 18 F-FDG SULpeak values in bone lesions (± RC 0 ) changed the response status of 4/28 patients compared to standard PERCIST response criteria.The changes were evenly divided between shifts from stable disease (SD) to progressive disease (PD) or to partial response (PR) when shifting from the PERCIST thresholds of ± 30% to the bone metastasis -RC, +RC threshold of change (-17.5%,+ 21.2%) for 18 F-FDG SULpeak values.In some cases new lesions appear, which is considered an overriding determination of progressive disease, regardless of the change in SUL, PERCIST/mPERCIST threshold or PERCIST inclusion criteria.

<FIGURE Discussion
primary nding, albeit based on a small study of 9 patients with a total of 38 metastatic bone lesions, is that the test-retest variability of 18 F-FDG uptake in bone is lower than has been previously published for soft-tissue [38,39,[44][45][46] or mixed tumors typical of breast cancer recurrence [35].As summarized in the QIBA Pro le summary paper [32], the within-subject coe cient of variation ranged from 10-12% in the above cited publications.In our study we estimated a within-subject coe cient of variation (wCV ∆ ) for SUVmax of 6.6% (95% CI: 5.0-8.2%).There are two implications from this reduction in variability: First that inclusion criteria can be relaxed compared to the EORTC, PERCIST, and QIBA proposals.Second, that the thresholds for determining response can also be reduced.These comparisons are described in Table 3.Note that this only contains an excerpt of the detailed EORTC and PERCIST response criteria.In addition, the EORTC and PERCIST response criteria are intended to provide information on disease status, while the QIBA Pro le Claims are providing information about the statistical variability of SUVs under the assumption of no true biological change.Acronyms.EORTC: European Organization for Research and Treatment of Cancer.PERCIST: PET Response Criteria in Solid Tumors.QIBA: Quantitative Imaging Biomarkers Alliance.FDG: 18Fourodexyglucose.SUV: standardized uptake value.SUVmax,: SUV calculated using the maximum value of a region placed over the image of an FDG-avid lesion.SULpeak,: The mean SUV of a 1 cm diameter region centered over the maximum value of a an FDG avid lesion with biodistribution normalization by lean body mass.

<Table 3>
As noted above, a small bias in the mean test-retest relative difference was observed for log-transformed SUVmax and SULpeak, where corresponding 95% CIs did not include 0. However, this was thought to be due to sampling variability rather than a true bias between the two scans.To be conservative in the repeatability coe cient estimates, we recalculated the repeatability metrics without subtracting the sample mean, assuming the true bias was zero, which would in effect include the estimated bias as part of the variability and thus somewhat increasing the variability estimates.This increased the estimated within-subject coe cient of variation ( ) from 5.9-6.6%.Justi cation for assuming a mean relative difference of zero includes; patients were scanned on the identical scanner for test and retest scans and had similar injected doses, blood wCV Δ glucose concentrations and uptake times.Additionally, the soft tissue tumors for these same patients in cohort-1 did not show a bias in test-retest SUV metrics [35], which may be related to the small size but intense 18 F-FDG uptake in bone metastases.
We did not see a difference in reproducibility for metastatic bone lesion between types of primary breast cancer disease, such as lobular or ductal, however the number of lesions studied was limited and most patients had ductal disease.
Measurement of FDG uptake followed a rigorous protocol that included frequent scanner calibration (3-4 times per year).Calibration may impose additional costs at institutions lacking onsite PET physics support.However, accurate measurement of SUV can potentially save resources as it can provide an early response measurement avoiding extra costs of unnecessary treatment, which also limits unnecessary toxic exposure for patients.

Conclusions
Quantitative Data Availability.The datasets generated and/or analyzed during the current study are included along with the manuscript submission in the Supporting Matierials section.Raw image data is available upon request.
Competing interests.We can con dently state that there is no con ict of interest-nancial or otherwise-that may directly or indirectly in uence the content of this manuscript.
Funding.Financial Support was provided by the National Cancer Institute, a division of the National Institutes of Health (NIH/NCI) grants U01-CA148131 (Kinahan/Linden); R50-CA211270 (Muzi); R01-CA124573 (Mankoff/Specht); P30-CA015704 (Hippe).We can con dently state that there is no con ict of interestnancial or otherwise-that may directly or indirectly in uence the content of this manuscript.
Author contributions.All Authors contributed to the concepts/study design, data acquisition or data analysis and interpretation and were guarantors of integrity of the entire study.The initial study concept, design and funding was provided by PEK, DAM and HML.The rst draft of the manuscript was written by MM, but all authors read, edited and approved the nal manuscript.Patient selection and management were performed by JMS, AN-J, JHL, DAM and HML.Image analysis and data collection were accomplished by MM, LMP and BFK.Statistical analysis and modeling of the data were completed by DSH.BFK and NO.

Supplementary Files
This is a of supplementary les associated with this preprint.Click to download. TestretestBoneMetsSupplementarymaterials20240102.docx

Figures
Figures

Figure 3 Percentage
Figure 3

Table 2 SUV
Repeatability metrics for all n = 38 lesions. *

Table 3
Comparison of EORTC and PERCIST response criteria, QIBA Pro le Claims and current study.
18F-FDG-PET SUV uptake values can be highly repeatable measures in breast cancer patients with bone metastases, when acquired in a well-calibrated PET scanner with careful attention to scanner calibration, acquisition protocols and image analysis.This small cohort indicates that repeat bone metastases SUV metrics can be measured with a within-patient COV ( ) of less than 8%.In evaluating response assessment in breast cancer patients with bone-dominant metastases, a percentage decrease in18F-FDG SUVmax of more than 17% would indicate response, while an increases of more than 20% would indicate disease progression, and unlikely to be due to measurement variability.Multicenter clinical trials, such as ECOG-ACRIN EA1183 (FEATURE) University of Washington) IRB (Institutional Review Board), Human Subjects and Radiation Safety committees.(See Manuscript, page 6 paragraph 3 section on Ethics and Consent).Consent to participate.Patients in both study cohorts were recruited from the Seattle Cancer Care Alliance or University of Washington Medical Center (Seattle, WA), and signed informed consent prior to enrollment.(See Manuscript, page 6 paragraph 3 section on Ethics and Consent).Consent to publish.authors a rm that human research participants provided informed consent for publication of the images in Figures 1A and 1B.The PET imaging technicians and PET physicists, Including Ms. Wanner, gave permission to acknowledge then in the Acknowledgements section.
FEATUREECOG-ACRIN EA1183 clinical trial: FDG PET to Assess Therapeutic Response in Patients with Bone-dominant Metastatic Breast Cancer Declarations Human ethics -All methods were performed in accordance with the ethical standards as laid down in the Declaration of Helsinki and its later amendments or comparable ethical standards, as approved by our local (