Harmonization of PET image Reconstruction HyperParameters in Simultaneous PET/MRI

Objective: Simultaneous PET/MRIs vary in their quantitative PET performance due to inherent differences in the physical systems and differences in the image reconstruction implementation. This variability in quantitative accuracy confounds the ability to meaningfully combine and compare data across scanners. In this work, we dene image reconstruction hyperparameters that lead to comparable contrast recovery curves across simultaneous PET/MRI systems. Method: The NEMA NU-2 image quality phantom was imaged on one GE Signa and on one Siemens mMR PET/MRI scanner. The phantom was imaged at 9.8:1 contrast with standard spheres (diameter 10, 13, 17, 22, 28, 37mm) and with custom spheres (diameter: 8.5, 11.5, 15, 25, 32.5, 44 mm) using a standardized methodology. Analysis was performed on a 30 minute listmode data acquisition and on 6 realizations of 5 minutes from the listmode data. Images were reconstructed with the manufacturer provided iterative image reconstruction algorithms with and without point spread function (PSF) modeling. For both scanners, a post-reconstruction Gaussian lter of 3 to 7 mm in steps of 1 mm were applied. Attenuation correction was provided from a scaled Computed Tomography (CT) image of the phantom registered to the MR-based attenuation images and veried to align on the non-attenuated corrected PET images. For each of these image reconstruction parameter sets, contrast recovery coecients (CRCs) were determined for the SUV mean , SUV max and SUV peak for each sphere. A hybrid metric combining the root mean squared discrepancy (RMSD) and the absolute CRC values were used to simultaneously optimize for best match in CRC between the two scanners while simultaneously weighting towards higher resolution reconstructions. The image reconstruction hyperparameter set were identied as the best candidate reconstruction for each vendor for harmonized PET image reconstruction. Results: The range of clinically relevant image reconstruction hyperparameters demonstrated widely different quantitative performance across cameras. The best match of CRC curves were obtained at the lowest RMSD values with: for CRC mean , 2 iterations -7mm lter with PSF on the GE Signa and 4 iterations -6mm lter on the Siemens mMR, for CRC max


Abstract
Objective: Simultaneous PET/MRIs vary in their quantitative PET performance due to inherent differences in the physical systems and differences in the image reconstruction implementation. This variability in quantitative accuracy confounds the ability to meaningfully combine and compare data across scanners.
In this work, we de ne image reconstruction hyperparameters that lead to comparable contrast recovery curves across simultaneous PET/MRI systems.
Method: The NEMA NU-2 image quality phantom was imaged on one GE Signa and on one Siemens mMR PET/MRI scanner. The phantom was imaged at 9.8:1 contrast with standard spheres (diameter 10, 13, 17, 22, 28, 37mm) and with custom spheres (diameter: 8.5, 11.5, 15, 25, 32.5, 44 mm) using a standardized methodology. Analysis was performed on a 30 minute listmode data acquisition and on 6 realizations of 5 minutes from the listmode data. Images were reconstructed with the manufacturer provided iterative image reconstruction algorithms with and without point spread function (PSF) modeling. For both scanners, a post-reconstruction Gaussian lter of 3 to 7 mm in steps of 1 mm were applied. Attenuation correction was provided from a scaled Computed Tomography (CT) image of the phantom registered to the MR-based attenuation images and veri ed to align on the non-attenuated corrected PET images. For each of these image reconstruction parameter sets, contrast recovery coe cients (CRCs) were determined for the SUV mean , SUV max and SUV peak for each sphere. A hybrid metric combining the root mean squared discrepancy (RMSD) and the absolute CRC values were used to simultaneously optimize for best match in CRC between the two scanners while simultaneously weighting towards higher resolution reconstructions. The image reconstruction hyperparameter set were identi ed as the best candidate reconstruction for each vendor for harmonized PET image reconstruction.
Results: The range of clinically relevant image reconstruction hyperparameters demonstrated widely different quantitative performance across cameras. The best match of CRC curves were obtained at the lowest RMSD values with: for CRC mean , 2 iterations -7mm lter with PSF on the GE Signa and 4 iterations -6mm lter on the Siemens mMR, for CRC max , 2 iterations -7mm lter on the GE Signa, 4 iterations -6mm lter on the Siemens mMR and for CRC eak , 4 iterations-7mm lter with PSF on the GE Signa and 3 iterations-6mm lter on the Siemens mMR. Over all reconstructions, the RMSD between CRCs were 2.4%, 3.1% and 2.3% for CRC mean, max and peak, respectively. The solution of 2 iterations-3mm on the GE Signa and 4 iterations-3mm on Siemens mMR, both with PSF, led to simultaneous harmonization and with high CRC and low RMSD for CRC mean, max and peak with RMSD values of 3.4%, 5.5% and 3.0 %, respectively.
Conclusions: For two commercially-available PET/MRI scanners, user-selectable hyperparameters that control iterative updates, image smoothing, and PSF-modeling provide a range of contrast recovery curves that allow harmonization in harmonization strategies of optimal match in CRC or high CRC values.
This work demonstrates that nearly identical CRC curves can be obtained on different commercially available scanners by selecting appropriate image reconstruction hyperparameters.

Full Text
Due to technical limitations, full-text HTML conversion of this manuscript could not be completed. However, the manuscript can be downloaded and accessed as a PDF.  [7] in the context of PET/CT. These efforts aimed at proposing speci cations and requirements in the patient preparation, injection and imaging in order to provide comparability and consistency for quantitative FDG-PET across scanners in oncology.

Tables
The aim of harmonization in PET is different than standardization. Standardization implies that sites use a uniform procedure with the goal to minimize variations, while harmonization aims at achieving comparable results across manufacturers or sites even though slightly different procedures are used. Both harmonization and standardization reduce variations, but harmonization encompasses standardization which is stricter but in this work we chose to concentrate on harmonization strategies. Harmonization aims at achieving the same level of accuracy and thus minimizing variations, or determine limits in tolerable variation. Most commonly in PET/CT, this harmonization has been performed through the use of carefully tuned scanner model-speci c post-reconstruction ltration [8]. Image generation for PET from PET/MRI is functionally the same as for PET/CT. The main differences are the scanner geometry which typically utilizes a smaller ring diameter and longer axial eld of view as compared to PET/CT and the use of MR-based attenuation correction techniques.
Currently, in the US two manufacturers (Siemens and General Electric) offer PET/MRI scanners, although a third vendor is poised to enter the US market (United Imaging). These two currently available PET/MRI scanners differ in the choice in scintillation crystal (size and material), overall detector geometry, photomultiplication technology (avalanche photodiode vs silicon photomultiplier array) and in their image reconstruction algorithms. Although both vendors provide ordered-subset estimation-maximization algorithm (OSEM), the implementation of the algorithm varies between vendors and the performance of the algorithm is expected to differ due the use of different number of subsets, number of iterations, data compression prior to reconstruction, implementation of the system matrix, crystal size and axial plane thickness as well as from the different implementations of the point spread function modeling. The aim of this study was to determine sets of mutable and clinically relevant image reconstruction hyperparameters on both systems that yield the optimal match in quantitative performance as a function of object size for tumor like objects. The study was performed on one scanner from each vendor with controlled phantom experiment to minimize errors in phantom lling and focus on isolating effects on accuracy of measurements from the speci cs of scanner hardware and image reconstruction algorithms.

Methods
PET imaging was performed using the Siemens Biograph mMR (Siemens Healthineers, Erlangen, GE) and with the General Electric Signa PET/MRI (General Electric, Wisconsin, USA). Scanner characteristics are compared in Table 1 and their performance evaluation was previously reported [9, 10]. Phantom imaging was performed using the NEMA IEC Phantom [11] (Data Spectrum Corp., North Carolina, USA) using the standard set of spheres (diameters: 10, 13, 17, 22, 28 and 38 mm) and a custom set of spheres (diameters: 8.5, 11, 15, 25, 32.5 and 44 mm). The second set of spheres contains intermediate sphere sizes and provide more data points for measurement and optimization. A standardized lling procedure was implemented to achieve a 9.8:1 ratio between the spheres and the water background -The procedure followed EANM FDG PET PET/CT guidelines [12].Two doses of approximately 20 MBq of 18 F were used. The rst dose was diluted in 1000 mL of water and the second dose was diluted into the approximately 9700 mL background water volume of the phantom chamber. The water volume of the phantom was determined by weight. The phantom was then centered in the PET eld of view ensuring all six spheres are in the same imaging plane. On the Siemens scanner, this was ensured by placing the phantom on a foam cradle. PET data acquisitions were performed in listmode for 30 minutes. A two-point DIXON [13] (LavaFlex on GE Signa) attenuation scan was acquired but was only used to register a phantom CT attenuation template as described below.
Images were reconstructed using the vendor provided image reconstruction software with a range of hyperparameters encountered in the clinic for oncologic whole-body PET. Data from GE Signa scanner was reconstructed o ine with the GE's Duetto PET reconstruction toolbox(v02.06) using 3D-OSEM (Ordered-Subset Expectation-Maximization) [14] algorithm with Time of Flight (TOF), at 2 and 4 iterations, 16 subsets, with and without point spread function (PSF) resolution modeling. On the Siemens mMR, images were reconstructed with e7tools (VE11P-SP2) using 3D-OSEM, 1 to 4 iterations, 21 subsets, with and without PSF resolution modeling. On each system, the number of subsets are xed by the manufacturers at 16 and 21 for clinical operations, respectively. The number of iterations on the GE Signa were selected from clinical use protocols for oncological PET. On the Siemens system, the number of iterations were selected to encompass the number of image updates (de ned by the product of number of iterations times the number of subsets). As such GE Signa images were reconstructed at 32 and 64 image updates, and the images on Siemens mMR were reconstructed with a range of 21 to 84 updates.
Time of Flight (TOF) was not turned off on the GE Signa as non-TOF PET reconstruction is clinically not typically performed on systems allowing TOF. TOF bene ts have been well documented to reduce image variance thereby reducing signal to noise, improving convergence rate and reducing artifacts [15] [16] and is typically used whenever available. Siemens mMR do not have the TOF option available. All reconstructions were repeated with a post-reconstruction Gaussian lter ranging from 3 to 7 mm. Images were reconstructed on a common voxel size of 2.34 mm in the transverse direction on a 256 matrix. GE Signa images were reconstructed on the native 2.78 mm slice thickness with standard axial ltering, while the Siemens mMR images were reconstructed on the native 2.027 mm slice thickness. Measured attenuation of the phantom by MRI methods such as DIXON or LavaFlex leads to inaccurate attenuation maps since the phantom material (water and plastic phantom wall) do not accurately mimic human tissue; water-like material will appear distorted, while plastics do not show at all [17]. The phantom also contained a 50 mm diameter cylindrical plastic insert lled with polystyrene and water to mimic the lungs. Consequently, standard tissue segmentation algorithm will fail when applied to the reconstructed image of the phantom. To avoid these issues, attenuation correction of the NEMA phantom is provided by the manufacturers using a template stored in the system. The CT-based template phantom attenuation map was used on the GE Signa and registered to the TOF-NAC PET images via rigid registration. For the mMR reconstructions, attenuation correction taken from a scaled CT image [18] [19]of the phantom aligned to the MR DIXON attenuation map and included attenuation of the patient bed as a 'hardware' attenuation map. The CT images of the phantom were aligned to MR using rigid registration (Euler transform) using the Elastix software [20,21] with three level multi-resolution along with b-spline interpolation, advanced mutual information metric and the adaptive stochastic gradient descent optimizer. Accuracy of the registration of the template attenuation was visually inspected and further veri ed by inspection of the resulting mu-map over the non-attenuation corrected PET images.
In a rst analysis, images were reconstructed using the entire 30 minute listmode data acquisition. In a second analysis, the 30 min listmode data was fragmented into 6 frames of approximately 5 minutes (277,286,295,304,314, and 324 seconds). The increasing frame duration ensured approximately equal number of collected events in each realization when accounting for radio-active decay. Contrast recovery coe cients (CRC) as de ned by Liow and Strother [22] were calculated using the scanner measured sphere activity concentration, the scanner measured background water activity concentration and the sphere and background water activity concentration calculated from the assayed activities, dilution data and decay corrected to scan time.
All images were resampled to provide cubic voxels of approximately 1 mm 3 . Contrast recovery coe cients were computed using three methodologies: from the average sphere activity in a spherical volume of interest (VOI) drawn with diameter of the physical inner sphere diameter (CRC mean ), from the maximum value in each sphere (CRC max ) and from the peak value in each sphere de ned as the average of a 1 cm 3 VOI with highest value (CRC peak ) within the physical sphere. The de nition of SUV peak was applied to de ne CRC peak [23]. As de ned, the CRC peak may or may not include the hottest pixel with the sphere. Results of the two sets of spheres were combined to provide CRC curves against the 12 sphere sizes ranging from 8.5 to 44 mm. The background water activity was measured as the average over two 50 mm diameter circular regions of interest localized in ve adjacent image planes of the phantom without spheres. These ROIs were used to de ne the image roughness which is a measure of the apparent noise [24]. The average coe cient of variation (COV) over those ten regions de nes the image

roughness (IR) and is calculated by
Where STD k is the standard deviation of the pixel intensity and the Mean k is the average in region k. The root mean squared discrepancy (RMSD) for all 800 image reconstruction hyperparameter combination pairs were then calculated.
Where i,j are the image reconstruction index from the GE Signa and Siemens mMR and the summation extends over all 12 spheres. The optimized PET imaged reconstruction hyper parameter set (#iterations, lter width, or use of PSF or not) was determined by selecting parameters that minimized the following hybrid metric: The analysis was performed using the entire 30 minutes of data rst, and in a second step, by averaging the results of the six independent realizations of approximately 5 minutes. Five minute of listmode data more closely simulates the statistics of a clinical scan oncology FDG. CRC curves agreement were generated for mean, max and peak CRC

Results
On the mMR, the phantom preparation resulted in average activity concentrations of 1767 Bq/mL ± 5.0% and 17053 Bq/mL ± 6.0% for the background volume and the spheres, respectively, at imaging times. These activity concentrations correspond to an average ratio of 9.66 ± 0.10. For the GE Signa, these values were 1622 Bq/mL ± 3.4% and 15526 Bq/mL ± 5.1%, corresponding to an average ratio of 9.57 ± 0.22. The average water volume of the phantom as determined by weight was 9737 ± 11 mL. Sorting the listmode data into six realizations of approximately 5 minutes resulted in an average of 47.3 ± 0.18 million trues per realization for the GE Signa and 42.0 ± 0.08 million trues per realization for the mMR.
The range of clinically relevant image reconstruction hyperparameters employed demonstrated widely different quantitative performance across the two manufacturers with regards to recovery of activity measurement as a function of object size. By varying both the number of iterations and post reconstruction lter level, bands of CRC curves were obtained that showed signi cant overlap between the two PET/MRI scanners. Contrast recovery coe cient curves for both systems using the full 30 minutes of acquisition data and four iterations and post-reconstruction lters of 3 and 7 mm are shown in Figure 1 (only these two lters are presented for clarity). CRC curves are presented for mean, max and peak hereafter referred to as CRC Mean , CRC Max , CRC peak . These plots depict the range of CRC values, and thus size-dependent SUV values, that are obtained in the clinical setting by varying the post-reconstruction lter and using the resolution recovery algorithm option. In this gure, the effect of increasing post-reconstruction ltration (from 3 to 7mm) is illustrated; as expected, less ltration consistently led to higher CRC values. The largest effect of the resolution recovery algorithm is observed on the CRC max , with smaller consequences associated with CRC mean and CRC peak . Of note, the CRC peak for spheres with diameter less than 13 mm would typically not be de ned as these spheres have a volume less than one cm 3 . The CRC peak for these smaller spheres are thus less that than their CRC mean as the VOI peak includes background surrounding activity. These are nevertheless included for comparison and completeness of the present study.
Plots of the image roughness vs CRC peak are presented in Figure 2 for each scanner for the 30 min and 5 min scans and for the NEMA phantom with the standard set of spheres. Only the 3mm ltration and CRC peak is presented for brevity. Plots of CRC mean and CRC max show are similar (not shown). The lter is applied post-reconstruction and will only contribute to decrease the values of CRC peak and image roughness (reduce noise). On the mMR, as the number of iteration increases from 1 to 4, the CRC peak values increase and reach a maximum value while the COV (image roughness) increases. This is especially seen on the smallest spheres. For the largest spheres, the CRC peak value is constant for all iteration number and is consistent with the fact that iterative reconstruction converges faster for larger objects. The PSF reconstructions, represented by the dashed lines, show higher CRC and lower image roughness (noise) at the same number of iterations. This is also consistent with PSF which improve resolution and decrease image noise. For the largest spheres, 'convergence' is approached as early as 2 iterations, but for the spheres of 17mm and smaller, the CRC peak values tend to increase slowly indicating more iterations are required for convergence. The image roughness values on the GE Signa datasets are of the same magnitude as for the Siemens mMR , indicating comparable signal/noise characteristics of images from the Signa datasets as compared to the mMR. Slightly lower Image roughness are observed on the Signa possibly due to the use of TOF. The CRC peak values at 2 and 4 iterations are approximately identical indicating close to convergence even for the smallest spheres. These observations are consistent with the use of TOF which is known to increase convergence rate and lower noise. The plots for the 5 minute reconstructions of image roughness vs CRC peak are presented in the bottom row. The error bars are calculated from the standard deviation over the six noise realizations and correspond to the ensemble noise. The same observations can be made as for the 30min datasets and show that the signal and image noise is highly comparable between the experiments performed on each scanner and that convergence is approached similarly. Figure 3 presents the distribution the RMSD vs sum CRC-product for all 600 image reconstruction pairs plotted for CRC mean , CRC max and CRC peak (top row). The reconstruction on the mMR with 1 iteration were omitted for clarity as they were shown to have not converged in Figure 2. The bottom row presents the same plots but with an expanded scale on RMSD with labels for speci c reconstruction sets. On these graphs, each reconstruction pair is represented by a box whose color depicts the overall amount of ltration, ranging from dark blue (with minimal ltration -both 3mm on mMR and Signa) to red (maximal ltration, both 7mm on mMR and Signa), with a gradual progression from blue to red for intermediate levels of ltration. In this respect, points on the right indicate reconstructions with low ltration and low agreement (high RMSD), points on the bottom left indicate solutions with high agreement (low RMSD) of CRC curves but low resolution (low sum CRC), and solutions on the upper left indicate highest resolution but still with high degree of agreement according to RMSD.
Solutions on the left are all achieved with low RMSD (good agreement of CRC curves) and varied level of resolution. This indicate multiple optimal solutions exist depending on the imaging task. Setting to b to 0 leads to the solutions with lowest RMSD. The solutions with the lowest numerical value of RMSD is obtained at the highest ltration of 6 and 7mm, however these also correspond to solutions with low sum CRC product. Indeed, optimization by RMSD alone will tend to select solutions with low resolution and re ect more the in uence of the post-reconstruction lter rather than the scanner performance. By setting the parameter b to 1.0, the solution that minimizes Equation 4 is represented by the green symbol. These plots show that a common optimized image reconstruction pairs exists at 4 iterations with PSF and 3mm lter for mMR and 2 iterations with IR and 3 mm lter on Signa. This solution correspond to a high CRC product and an acceptable RMSD value, and show that harmonization solution exists in which both highresolution and excellent agreement in CRC curves can be achieved. The RMSD values using the CRC peak are also, on average, smaller than the RMSD values of CRC mean and CRC max . Figure 4 presents the CRC curves for the reconstruction hyperparameter set that provide the best match for CRC mean , CRC max , and CRC peak using the full 30 minutes of acquired data to provide either lowest RMSD (smoother images -right ), (high resolution -left) and an intermediate solution (middle). The solutions with highest Sum CRC and low RMSD provides a simultaneous optimization of resolution and CRC agreement. The optimized solution for CRC max show the highest level of noise and thus we provide an intermediate solution with slightly increased ltering as a compromise solution in which excellent agreement in CRC curves is achieved at less noise but still with a low RMSD. The corresponding image reconstruction hyperparameters pairs are reported in Table 2. Excellent agreement between the scanners are observed for the three scenarios, with only subtle differences. At best RMSD match, 100% of recovery (CRC=1.0) is obtained generally for CRC max or CRC peak for spheres larger than 20 mm diameter and generally for lower CRC values. The RMSD values for three harmonization strategies are reported in Table  2 and Figure 4. It is interesting to note that lower RMSD values are observed for CRC peak , than by CRC mean and nally CRC max , suggesting that closer 'harmonization' can be achieved using SUV peak rather than by SUV max or SUV mean . In this gure, the green dash lines represents the EANM suggested limits in CRC max values for quali cation of PET/CT where the Recovery Coe cients (RC) values of [25,26] were converted to CRC by using a known ratio of 10. Figure 5 show the harmonized CRC curves for mean, max, and peak values using only 5 minutes of listmode data. Average values were derived from the six noise realizations. Error bars on these plot correspond to the ensemble noise on the CRC values over the 6 noise realizations. Five minutes of listmode data corresponds to a more clinically relevant scenario as it more closely mimics the data acquisition and the level of statistics encountered in oncology FDG PET/CT scans. Table 3 contains the harmonized image reconstruction hyperparameters obtained using 5 minutes of listmode data along the three optimization criteria. As in Figure 4, excellent agreement between the two scanners are found, and similarly harmonized image reconstruction hyperparameters can be determined. The RMSD values are approximately equal to the 30 minute acquisition.

Effect of scan time on harmonized hyperparameters
CRC curves of images reconstructed using only 5 minutes of listmode data were compared, employing the image reconstruction hyperparameters that provided the best match from the 30 minute scans (data shown in Figure S1 in supplemental data). Very good agreement in CRC mean was observed for 5 minutes acquisition using the 30 minutes harmonization hyperparameters as re ected in the CRC curves and in the value of RMSD. The best agreement for CRCmax was found when selecting the 30-minutes harmonized reconstruction hyperparameters for the intermediate and best RMSD optimization while the worst agreement is observed for best CRC max The mean and peak CRC curves show generally better agreement than the CRC max curves. The largest differences were observed when looking at CRC max for best 30-minutes harmonized reconstruction hyperparameters at the high CRC values which re ect the highest noise encountered in the 5min data. However, in all cases, the overall RMSD values are acceptably small.

Discussion
In this study, we determined 'harmonized' image reconstruction hyperparameters for CRC mean , CRC max and CRC peak for the Siemens mMR and GE Signa PET/MRI systems. The experiments were performed in a controlled setting focus on variability caused by scanner hardware design and image reconstruction settings. This work excludes errors in measurements due to subjective manual regions of interest de nition. CRC variability under clinically relevant range of image reconstruction hyperparameters (iterative updates, Gaussian ltration), algorithm implementation (3D OSEM, 3D OSEM plus resolution recovery) by each vendor were systematically varied. The imaging protocol was designed and executed to minimize the variability in the phantom preparation by using a rigorous phantom lling procedure, phantom alignment and imaging protocol.
The reconstructions were performed using an attenuation map template of the phantom. Since the effect of attenuation correction is decoupled from the choice of reconstruction hyperparameters, our work establishes image reconstruction hyperparameters across the two systems that will allow the study of the consequences in quantitation regarding the accuracy of the recovery coe cients of lesions due to the choice of attenuation correction strategy, as well as subtle implementation differences among the two vendors. Once the image reconstruction hyperparameters are harmonized from the PET data, effects such as choice of attenuation correction, positioning aid and others can be more accurately studied for a given scanner and across scanner vendors. Ultimately, complete harmonization of simultaneous PET/MR scanners will need to include an attenuation correction as measured from the scanner.
This study aimed at identifying harmonized image reconstruction hyperparameters for the two most widely used simultaneous PET/MRI scanners with a multi-ll well-controlled experiment and differ from a multi-site phantom study. Measured variability of quantitative performance between sites using the same make and model scanners comes from two main sources. The rst, and most major, is variability of phantom ll. We minimized this variability through performance of rigorous lling procedure, identical at each site. In this study, we minimized the variability by used long scans, used identical ll activities, all activities were measured in dose calibrators calibrated to a NIST traceable 511keV source, weights were used to access phantom ll volumes.
The second source of error is associated with fundamental intrinsic quantitative performance differences between studies performed on two physically different, but same make and model scanner. These differences, as manifested in the scanner model speci c performance CRC curve on an appropriately calibrated and tuned scanner are quite small. In fact, precise CRC performance using the NEMA IQ phantom (the same as used in our studies) is used by vendors as acceptance criteria for scanner installations. This variability is small compared to other sources of error, most signi cantly ll accuracy and precision.
Remarkably similar quantitative performances were achieved through mutual tuning of reconstruction hyperparameters for both the 30 minute low-noise use-case, and the more clinically relevant 5 minute acquisitions. The clinical implication is that if patients are imaged under technically and biologically controlled conditions, but on different PET/MRI systems, prospectively-used harmonized reconstruction parameter sets will result in nearly identical quantitative measurements independent of the system used.
This conclusion is independent of lesion size. This aspect has important consequences to multi-center clinical trials where data will be aggregated from different models of PET/MR systems.
It should be noted that the 'harmonized' image reconstruction hyperparameters are not necessarily those that would yield to the highest CRC values across all spheres. Indeed, we have identi ed that a harmonization approach relying solely on the lowest RMSD values leads to solutions of high-level of ltering and therefore will correspond to very smooth images. This solution will be detrimental for imaging task of lesion detection albeit providing the good agreement in CRC values. This solution emphasize more the effect of ltering rather than the performance of the scanner, which is appropriate for some clinical and clinical trial applications, but certainly not all. A solution with high CRC values (but still with acceptably low RMSD) would be typically obtained at larger number of iterations and minimal ltering (as shown) but images would be subject to higher noise levels. In particular, the CRC max reaches values signi cantly higher than 1.0 for image reconstruction with 3mm post-reconstruction lters and using resolution recovery. We have shown that harmonization solutions exist for which, depending on the imaging task, being either lesion detection (high CRC) or quantitation accuracy across sites minimizing RMSD, excellent agreement in CRC values can be achieved between these two scanners.
The phantom was prepared under conditions mimicking conditions typically encountered in clinical practice with 18 F-FDG in PET/CT. Imaging protocols suggests imaging at 60 minutes post-injection of 370-740 MBq (10-20 mCi) 18 F-FDG from head to mid-thigh in a series of slightly overlapping bed positions, each with axial eld of view of 20-25 cm. Although, substantial variability exists in clinical PET/CT, typical acquisition times are of the order of 2 -4 minutes per bed position. Therefore, assuming uniform distribution in an average sized human, a typical injection yields to approximately 5000 Bq/mL (e.g. for 555 MBq (15mCi) injection administer and imaged 60 minute post-injection in a 75 kg patient).
In this work, the phantom was prepared with a nominal background activity concentration of ~1600-1800 Bq/mL, and thus the 5min scan would yield similar count statistics to a clinical 18 F-FDG acquisition of 2 min per bed position with the 30-min study resulting in 6 times the counts as a typical clinical study. The 30 minutes acquisition data are used to determine the optimal harmonized hyperparameters in images with minimal noise and thus be able to determine image reconstruction hyperparameters that yield most comparable CRC coe cients free from limitations due to statistical noise.
The average activity concentration at imaging time was less in the experiments performed on the GE Signa by approximately 8%. However, this scanner bene ts from a higher sensitivity (21 vs 15 cps/kBq) relative to the Siemens mMR and thus when accounting for the relative scanner sensitivity, more counts were acquired on GE Signa (~12% more). In addition, the GE Signa employs Time of Flight (TOF), while the Siemens mMR does not. The main advantage of TOF is faster convergence and higher signal to noise. This may explain, at least in part, why the best matching CRC curves are obtained with 2 iterations/16 subsets on the GE Signa scanner as opposed to 4 iterations/21 subsets on the Siemens mMR.
CRC mean and CRC peak appear to be more robust metrics used as the basis for harmonization when comparing quantitative results from PET/MRI scanners than CRC max (and thus SUV max ), as is expected. This is likely an effect of statistical noise even for 30 minutes data sets, and this effect is greater for 5 minute acquisition times. The image noise (image roughness) depends on a variety of factors in the image reconstruction chain (number of iterations and post-reconstruction lter), and includes the choice of algorithm, use of TOF, and especially the use of resolution recovery. As such, image noise cannot be rigorously compared. Howewer, the reconstructed noise was determined by the image roughness in the reconstructed images and for these matched experiments, (identical ll and imaging in similar conditions), comparable signal and noise were achieved between the two cameras. In phantom studies, SUV mean is highly robust since the lesion volume is known and the activity distribution within the lesion is uniform. This is not the case in patient studies and extreme variability is observed in segmentation volume making it of little clinical use, currently. So lesion SUVmean is not recommended within the context of clinical trial response assessment. SUVmax is most typically used. Inter-reader measurement variability of SUVmax is small, and it is a robust measurement, although impacted signi cantly by image noise. SUVpeak has slowly been gaining acceptance as a more robust (less sensitive to noise) metric of response, although literature support for its use is less prevalent, currently. Similarly, our data indicate that SUV peak is likely the most repeatable measure among the three and that SUV max being more affected by noise. SUV peak will generate higher SUV values than SUV mean , however it can only be de ned for lesion larger than 1cm. In studies where quantitative harmonization is a critical aspect to the trial's response assessment, then tighter harmonization appears to be achievable when using the SUV peak metric. The SUV peak metric seems more independent on the choice of region of interest and the level of smoothing thus making this metric more amenable to harmonization.
A limitation of this study is that the experiments were performed in phantoms at a set count density in the spheres and background activity. Extrapolation to human imaging is not directly translatable as OSEM algorithm are not linear and performance will be dependent on the speci c patient activity distribution and count density. This limitation is common to all studies in phantoms. An alternative approach could be to insert synthetic lesions of varied activity (SUV), size and shapes in clinical patient datasets. This is currently an active area of research that we are investigating.
Our data indicates that the harmonized images reconstruction hyperparameters proposed here (both in the ideal long scanning acquisition and in clinical conditions) for PET/MR scanners can be achieved and that comparable size-dependent recovery coe cients, or size-dependent tumor SUV values can be obtained and well within the limits proposed by EARL-EANM. PET data acquired on PET/MR scanners would thus be acceptable to be included in multi-center clinical trials, at least as de ned by the EARL-EANM criteria. However, this study goes beyond EARL-EANM as it determines image reconstruction hyperparameters that provide practically identical CRC curves between these two scanners and thereby show that variability of in small lesions quantitation can be largely eliminated by controlling the image reconstruction hyperparameters. This conclusion is important as it will allow to further study other factors affecting quantitative PET in PET/MRI such as the speci c choice of attenuation correction technique (including the level at which the bones are included), patient positioning aids and others.

Conclusion
Quantitative PET is in uenced by a variety of technical, biological and physical factors. This work demonstrates that harmonization of reconstruction hyperparameters in PET in simultaneous PET/MR is possible and can yield images with nearly identical quantitative performance in terms of CRC measurements over a range of lesion sizes. For the two commercially-available PET/MRI scanners evaluated, user-selectable hyperparameters that control iterative updates, image smoothing, and PSFmodeling provide a range of contrast recovery curves that allow harmonization. This work demonstrates that essentially identical CRC curves can be obtained on two commercially available scanners by a proper choice of image reconstruction hyperparameters. This work will form the basis of further study on the quantitative performance related to the choice of attenuation correction strategy.

Declarations
Acknowledgments Figure 2 Image roughness vs CRCpeak plots for the NEMA standard set of spheres for the Siemens mMR (left) and GE Signa (right). The top row shows plots for the whole 30min of listmode data and the bottom row, the average value over six frames of 5 minutes duration. Error bars in the bottom plot show the ensemble noise over the six realizations. CRCmean,max,peak-product vs RMSD box plot. Each of the 600 pairs of image reconstructions hyperparameters combinations is represented by a box (top row). The color scale from blue to red represents the different level of overall ltration. Dark blue boxes represents combinations with least overall ltration (i.e. 3mm for both mMR and Signa images) and dark red correspond to most ltration (7mm on mMR and Signa images). Bottom row, expanded view at low RMSD to identify candidate image reconstruction parameters for optimization. The solution for best RMSD and CRC is indicated in green.

Figure 4
Matching CRC curves for mean (top), max (middle), and peak values (bottom) for 30 minutes of data corresponding to three optimization scenario: right) optimized for best resolution or highest CRC values, middle) optimized for both resolution and RMSD, and left) for best RMSD . The corresponding RMSD values are indicated in each panel. Results from the Siemens Biograph mMR are in red, and from the GE Signa are in blue. The dash lines represent the EANM limits on CRC max. Matching CRC curves for mean (top), max (middle), and peak values (bottom) for 5 minutes of data corresponding to three optimization scenario: right) optimized for best resolution or highest CRC values, middle) optimized for both resolution and RMSD, and left) for best RMSD . The corresponding RMSD values are indicated in each panel. Results from the Siemens Biograph mMR are in red, and from the GE Signa are in blue. The dash lines represent the EANM limits on CRC max.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.