In this study, we determined ‘harmonized’ image reconstruction hyperparameters for CRCmean, CRCmax and CRCpeak for the Siemens mMR and GE Signa PET/MRI systems. The experiments were performed in a controlled setting focus on variability caused by scanner hardware design and image reconstruction settings. This work excludes errors in measurements due to subjective manual regions of interest definition. CRC variability under clinically relevant range of image reconstruction hyperparameters (iterative updates, Gaussian filtration), algorithm implementation (3D OSEM, 3D OSEM plus resolution recovery) by each vendor were systematically varied. The imaging protocol was designed and executed to minimize the variability in the phantom preparation by using a rigorous phantom filling procedure, phantom alignment and imaging protocol.
The reconstructions were performed using an attenuation map template of the phantom. Since the effect of attenuation correction is decoupled from the choice of reconstruction hyperparameters, our work establishes image reconstruction hyperparameters across the two systems that will allow the study of the consequences in quantitation regarding the accuracy of the recovery coefficients of lesions due to the choice of attenuation correction strategy, as well as subtle implementation differences among the two vendors. Once the image reconstruction hyperparameters are harmonized from the PET data, effects such as choice of attenuation correction, positioning aid and others can be more accurately studied for a given scanner and across scanner vendors. Ultimately, complete harmonization of simultaneous PET/MR scanners will need to include an attenuation correction as measured from the scanner.
This study aimed at identifying harmonized image reconstruction hyperparameters for the two most widely used simultaneous PET/MRI scanners with a multi-fill well-controlled experiment and differ from a multi-site phantom study. Measured variability of quantitative performance between sites using the same make and model scanners comes from two main sources. The first, and most major, is variability of phantom fill. We minimized this variability through performance of rigorous filling procedure, identical at each site. In this study, we minimized the variability by used long scans, used identical fill activities, all activities were measured in dose calibrators calibrated to a NIST traceable 511keV source, weights were used to access phantom fill volumes.
The second source of error is associated with fundamental intrinsic quantitative performance differences between studies performed on two physically different, but same make and model scanner. These differences, as manifested in the scanner model specific performance CRC curve on an appropriately calibrated and tuned scanner are quite small. In fact, precise CRC performance using the NEMA IQ phantom (the same as used in our studies) is used by vendors as acceptance criteria for scanner installations. This variability is small compared to other sources of error, most significantly fill accuracy and precision.
Remarkably similar quantitative performances were achieved through mutual tuning of reconstruction hyperparameters for both the 30 minute low-noise use-case, and the more clinically relevant 5 minute acquisitions. The clinical implication is that if patients are imaged under technically and biologically controlled conditions, but on different PET/MRI systems, prospectively-used harmonized reconstruction parameter sets will result in nearly identical quantitative measurements independent of the system used. This conclusion is independent of lesion size. This aspect has important consequences to multi-center clinical trials where data will be aggregated from different models of PET/MR systems.
It should be noted that the ‘harmonized’ image reconstruction hyperparameters are not necessarily those that would yield to the highest CRC values across all spheres. Indeed, we have identified that a harmonization approach relying solely on the lowest RMSD values leads to solutions of high-level of filtering and therefore will correspond to very smooth images. This solution will be detrimental for imaging task of lesion detection albeit providing the good agreement in CRC values. This solution emphasize more the effect of filtering rather than the performance of the scanner, which is appropriate for some clinical and clinical trial applications, but certainly not all. A solution with high CRC values (but still with acceptably low RMSD) would be typically obtained at larger number of iterations and minimal filtering (as shown) but images would be subject to higher noise levels. In particular, the CRCmax reaches values significantly higher than 1.0 for image reconstruction with 3mm post-reconstruction filters and using resolution recovery. We have shown that harmonization solutions exist for which, depending on the imaging task, being either lesion detection (high CRC) or quantitation accuracy across sites minimizing RMSD, excellent agreement in CRC values can be achieved between these two scanners.
The phantom was prepared under conditions mimicking conditions typically encountered in clinical practice with 18F-FDG in PET/CT. Imaging protocols suggests imaging at 60 minutes post-injection of 370-740 MBq (10-20 mCi) 18F-FDG from head to mid-thigh in a series of slightly overlapping bed positions, each with axial field of view of 20-25 cm. Although, substantial variability exists in clinical PET/CT, typical acquisition times are of the order of 2 – 4 minutes per bed position. Therefore, assuming uniform distribution in an average sized human, a typical injection yields to approximately 5000 Bq/mL (e.g. for 555 MBq (15mCi) injection administer and imaged 60 minute post-injection in a 75 kg patient). In this work, the phantom was prepared with a nominal background activity concentration of ~1600-1800 Bq/mL, and thus the 5min scan would yield similar count statistics to a clinical 18F-FDG acquisition of 2 min per bed position with the 30-min study resulting in 6 times the counts as a typical clinical study. The 30 minutes acquisition data are used to determine the optimal harmonized hyperparameters in images with minimal noise and thus be able to determine image reconstruction hyperparameters that yield most comparable CRC coefficients free from limitations due to statistical noise.
The average activity concentration at imaging time was less in the experiments performed on the GE Signa by approximately 8%. However, this scanner benefits from a higher sensitivity (21 vs 15 cps/kBq) relative to the Siemens mMR and thus when accounting for the relative scanner sensitivity, more counts were acquired on GE Signa (~12% more). In addition, the GE Signa employs Time of Flight (TOF), while the Siemens mMR does not. The main advantage of TOF is faster convergence and higher signal to noise. This may explain, at least in part, why the best matching CRC curves are obtained with 2 iterations/16 subsets on the GE Signa scanner as opposed to 4 iterations/21 subsets on the Siemens mMR.
CRCmean and CRCpeak appear to be more robust metrics used as the basis for harmonization when comparing quantitative results from PET/MRI scanners than CRCmax (and thus SUVmax), as is expected. This is likely an effect of statistical noise even for 30 minutes data sets, and this effect is greater for 5 minute acquisition times. The image noise (image roughness) depends on a variety of factors in the image reconstruction chain (number of iterations and post-reconstruction filter), and includes the choice of algorithm, use of TOF, and especially the use of resolution recovery. As such, image noise cannot be rigorously compared. Howewer, the reconstructed noise was determined by the image roughness in the reconstructed images and for these matched experiments, (identical fill and imaging in similar conditions), comparable signal and noise were achieved between the two cameras. In phantom studies, SUVmean is highly robust since the lesion volume is known and the activity distribution within the lesion is uniform. This is not the case in patient studies and extreme variability is observed in segmentation volume making it of little clinical use, currently. So lesion SUVmean is not recommended within the context of clinical trial response assessment. SUVmax is most typically used. Inter-reader measurement variability of SUVmax is small, and it is a robust measurement, although impacted significantly by image noise. SUVpeak has slowly been gaining acceptance as a more robust (less sensitive to noise) metric of response, although literature support for its use is less prevalent, currently. Similarly, our data indicate that SUVpeak is likely the most repeatable measure among the three and that SUVmax being more affected by noise. SUVpeak will generate higher SUV values than SUVmean, however it can only be defined for lesion larger than 1cm. In studies where quantitative harmonization is a critical aspect to the trial’s response assessment, then tighter harmonization appears to be achievable when using the SUVpeak metric. The SUVpeak metric seems more independent on the choice of region of interest and the level of smoothing thus making this metric more amenable to harmonization.
A limitation of this study is that the experiments were performed in phantoms at a set count density in the spheres and background activity. Extrapolation to human imaging is not directly translatable as OSEM algorithm are not linear and performance will be dependent on the specific patient activity distribution and count density. This limitation is common to all studies in phantoms. An alternative approach could be to insert synthetic lesions of varied activity (SUV), size and shapes in clinical patient datasets. This is currently an active area of research that we are investigating.
Our data indicates that the harmonized images reconstruction hyperparameters proposed here (both in the ideal long scanning acquisition and in clinical conditions) for PET/MR scanners can be achieved and that comparable size-dependent recovery coefficients, or size-dependent tumor SUV values can be obtained and well within the limits proposed by EARL-EANM. PET data acquired on PET/MR scanners would thus be acceptable to be included in multi-center clinical trials, at least as defined by the EARL-EANM criteria. However, this study goes beyond EARL-EANM as it determines image reconstruction hyperparameters that provide practically identical CRC curves between these two scanners and thereby show that variability of in small lesions quantitation can be largely eliminated by controlling the image reconstruction hyperparameters. This conclusion is important as it will allow to further study other factors affecting quantitative PET in PET/MRI such as the specific choice of attenuation correction technique (including the level at which the bones are included), patient positioning aids and others.