Convolutional Neural Networks for Automatic Image Quality Control and EARL Compliance of PET Images

doi:10.21203/rs.3.rs-964263/v2

Background: Machine learning studies require a large number of images often obtained on different PET scanners. When merging these images, the use of harmonized images following EARL-standards is essential. However, when including retrospective images, EARL accreditation might not have been in place. The aim of this study was to develop a convolutional neural network (CNN) that can identify retrospectively if an image is EARL compliant and if it is meeting older or newer EARL-standards.

Materials and Methods: 96 PET images acquired on three PET/CT systems were included in the study. All images were reconstructed with the locally clinically preferred, EARL1, and EARL2 compliant reconstruction protocols. After image pre-processing, one CNN was trained to separate clinical and EARL compliant reconstructions. A second CNN was optimized to identify EARL1 and EARL2 compliant images. The accuracy of both CNNs was assessed using 5-fold cross validation. The CNNs were validated on 24 images acquired on a PET scanner not included in the training data. To assess the impact of image noise on the CNN decision, the 24 images were reconstructed with different scan durations.

Results: In the cross-validation, the first CNN classified all images correctly. When identifying EARL1 and EARL2 compliant images, the second CNN identified 100% EARL1 compliant and 85% EARL2 compliant images correctly. The accuracy in the independent dataset was comparable to the cross-validation accuracy. The scan duration had almost no impact on the results.

Conclusion: The two CNNs trained in this study can be used to retrospectively include images in a multi-center setting by e.g. adding additional smoothing. This method is especially important for machine learning studies where the harmonization of images from different PET systems is essential.

Biomedical Engineering

Nuclear Medicine & Medical Imaging

PET

image quality

convolutional neural network

Positron Emission Tomography combined with Computed Tomography (PET/CT) using the tracer ¹⁸F-fluorodeoxyglucose (FDG) is nowadays widely used in oncology [1–3]. PET/CT has become part of the clinical routine for initial diagnosis, prognosis or treatment response assessment for cancer patients [4, 5].

To date, PET images are mainly analyzed visually to e.g. determine the tumor stage. In the last decade, the quantitative analysis of PET images has become more and more popular as it increases the reproducibility and objectivity of clinical decision making [6, 7].

For this purpose, PET images are routinely converted to Standardized Uptake Values (SUV) which normalizes the radioactivity displayed in the image by body weight and amount of injected tracer and thereby corrects for possible discrepancies across patients [8]. Next, quantitative assessment is performed by using basic SUV metrics, more complex image biomarkers such as radiomic features or most recently by deep learning analysis [9, 10].

To perform reliable quantitative assessment, images need to be comparable across patients but also across institutions and PET/CT systems [11, 12]. However, due to the different image characteristics of scanner vendors or used reconstruction algorithms a wide variability in image quality across institutions and PET/CT systems remains which hampers a clinical implementation of quantitative image assessment [13]. To overcome these discrepancies, the EANM provides certain guidelines and benchmark values to harmonize PET/CT images across centers [14, 15]. Two EARL standards exist and are established in participating institutions: One standard for older PET/CT systems harmonizing images reconstructed without Point-Spread-Function (PSF) (EARL1) and the new EARL standard defined for modern PET cameras for images reconstructed with PSF (EARL2) [16, 17]. Only images that are in accordance with these standards should be used for quantitative analysis.

To select harmonized reconstruction parameters, a scan of the NEMA image quality phantom is acquired and the reconstruction setting leading to activity recoveries in a certain bandwidth is chosen [15]. However, when including images retrospectively in a study, EARL accreditation might not have been in place or information in the DICOM header is missing such that reconstruction settings cannot be verified. Images that are not EARL compliant can be retrospectively converted to EARL compliant images by adding additional Gaussian smoothing [18] or data derived from the images by a data transformation method called COMBAT. To determine the amount of additional smoothing or the need for data transformation, it is of great interest to determine retrospectively if images that should be included in a study are meeting EARL requirements.

To the best of our knowledge, this is the first study that uses a CNN to determine the image quality of a PET image. For this purpose, we trained, validated, and tested two convolutional neural networks (CNN) to determine if an image meets EARL standards. The first CNN is trained to determine if an image meets EARL standards in general (e.g. EARL1 or EARL2). The second CNN is then consecutively used to determine if an image that was identified as being EARL compliant is EARL1 or EARL2 compliant. Both CNNs were trained and cross-validated with data from three PET/CT systems from three PET scanner vendors (Siemens, Philips, and General Electrics) and externally validated on a forth PET/CT scanner in order to assess the generalizability of the trained algorithm.

Datasets

Training and cross-validation

The dataset used for training and cross-validating the algorithm consists of 96 images from cancer patients acquired on three PET/CT systems: 36 images acquired on a Philips Gemini (Philips Medical Systems, Best, The Netherlands), 30 images on a Siemens Biograph mCT40 (Siemens, Knoxville, TN, USA), and 30 images on a General Electric Discovery system (General Electric, Boston, Massachusetts, USA). As all images were acquired in clinical routine, the patients included in this study suffered from different cancer types. All images were reconstructed with (1) the locally clinically preferred, (2) EARL1 and (3) EARL2 compliant reconstruction settings and with a 120 s scan duration. The exact reconstruction settings for each scanner are displayed in Table 1. All data included in this study are taken from clinical routine. The use of the data was approved by the Institutional Medical Ethics Committees (case number VUMC 2018.029, UMCG 2017/489). All data was fully anonymized.

Table 1

Reconstruction settings for different scanner types.
Scanner type	Clinical	EARL1	EARL2
Siemens Biograph mCT40	PSF + 2 mm smoothing	OSEM + 6 mm smoothing	PSF+TOF + 5 mm smoothing
Philips Gemini	PSF + 2 mm smoothing	OSEM + 5 mm smoothing	PSF + TOF + 5mm smoothing
GE Discovery	PSF + TOF + 2 mm smoothing	PSF + TOF + 8 mm smoothing	PSF + TOF + 5 mm smoothing
Siemens Biograph Vision	PSF + TOF + 0 mm smoothing	PSF + TOF + 7 mm smoothing	PSF + TOF + 5 mm smoothing

Independent testing datasets

To test the CNN performance, 30 images prospectively acquired on the Siemens Biograph mCT40 were used. These images were also reconstructed with clinical, EARL1, and EARL2 compliant reconstruction settings.

Moreover, 24 images acquired on a scanner that was not included in the training data (Siemens Biograph Vision) were used for independent external validation. In order to determine if image noise has an impact on the CNN decision, all images acquired at the Biograph Vision were reconstructed with 30 s, 60 s, 120 s, and 180 s scan duration. All independent external validation images were also acquired as part of clinical routine care of oncological patients.

Training and validation of the CNN

Image modifications, as well as data analysis were performed with python 3.4. All implemented code used in this study can be found on Zenodo (10.5281/zenodo.5540390).

Data preparation and augmentation

First, all images were normalized to SUV units. Next, before images were used for training or validation, the images were converted to ‘edge images’. For this purpose, all images were blurred with a 6 mm Full-Width-At-Half-Maximum Gaussian kernel. Next, the blurred image was subtracted from the original image. The so generated ‘edge image’ pronounces the edges of high intensity areas while minimizing scanner specific noise. These ‘edge images’ are pronouncing the resolution of an image and resulted therefore in higher accuracies in training and validation than the original PET image (supplemental material, Table 1 and 2). An example is displayed in Figure 1. The edge images were used for training, cross-validating, and independent validating the CNN.

Table 2

Training and cross-validation accuracy for the first CNN trained to separate clinical and EARL compliant reconstructions.
Fold number	Training accuracy	Validation accuracy for clinical reconstructions	Validation accuracy for EARL compliant recons
1	89%	100%	100%
2	87%	100%	100%
3	89%	100%	100%
4	91%	100%	100%
5	87%	100%	100%

All images were resampled to the same voxel size before they were converted to edge images. Then, the images were converted to an image size of 300 x 200 pixels. Hereby, the images were padded with a constant value of 0 if their original image size was smaller than 300 x 200. If the image size was larger, they were cut by randomly choosing an image part with the required size. Hereby, the rows and columns for the cut were randomly assigned to lie between 0 and (300 – original image height) or (200 – original image width), respectively. This randomly cutting was performed to ensure that different body parts were present in the different images.

In order to avoid overfitting and to increase the number of training images, data augmentation was performed. Data augmentation included zooming the image, i.e. making it 10% larger or smaller, flipping the image either in vertical or horizontal direction, and adding a height or width shift of 10%.

CNN architecture and training details

The CNNs were trained using the keras library version 2.2.4 with tensorflow backend. The trained CNNs as well as some example images and a manual how to train and validate the CNNs can be found on Zenodo. In this study, a 2D CNN is trained to classify single image slices (more details see below). The CNN architecture is displayed in Figure 2. It consists of one convolutional block followed by a dropout layer and two dense layers. The convolutional block contains a convolutional layer with a kernel size of 3, a stride size of 2, and 8 filters. The initial weights of the convolutional layer are following a normal distribution. The dropout percentage in the dropout layer is set to 0.6 to avoid overfitting and to increase the generalizability of the CNN. The output of the dropout layer is flattened and fed into a dense layer with 8 units and a ReLU activation function. The second and last dense layer contains 2 units and a softmax activation function to perform the classification. The CNN is trained with 25 epochs and a batch size of 20. CNNs learn features that are important for the classification task automatically. To build a generalizable CNN, it is important to learn only features representing the specific task and not representing other details such as scanner specific noise. Therefore, it is essential to build a CNN that is generalizable and can be applied to data that is not included in the training process. In order to improve the generalizability, the CNN stopped training when the accuracy in the training set did not improve during 5 epochs. The CNN leading to the best validation accuracy in the training process is saved and used for further analysis.

To identify EARL compliant images, two CNNs with the described architecture were trained. Two separate CNNs were trained as the use of only one CNN with three outcomes (Clinical, EARL1, and EARL2) resulted in worse performance (see supplemental material, Table 3). The first CNN was trained to separate images that are meeting EARL standards (EARL1 and EARL2) from images that are not EARL compliant. For training this CNN, the clinical reconstructions of the three scanners as well as the EARL2 compliant images from the GE and the Siemens system and the EARL1 compliant images from the Philips system were used. This subselection was performed to avoid data imbalance. The second CNN was trained to determine if an image that was identified to be EARL compliant is compliant with either EARL1 or EARL2 standards. For this task, EARL1 and EARL2 compliant images of all scanners were used for training.

Table 3

Training and cross-validation accuracy for the second CNN trained to classify EARL1 and EARL2 compliant reconstructions.
Fold number	Training accuracy	Validation accuracy for EARL1 compliant reconstructions	Validation accuracy for EARL2 compliant recons
1	88%	100%	83%
2	89%	100%	87%
3	87%	100%	87%
4	91%	100%	78%
5	93%	100%	87%

Slice selection

In the present study, only sagittal slices were used for classification as they showed the best results in initial experiments as compared to using axial or coronal slices (data not shown). For the first CNN trained to separate clinical and EARL compliant images, ten slices in the middle of the patient were chosen. Hereby, the middle of the image was determined and the five slices on the left hand and the five slices on the right hand side were included in the training dataset. For the second CNN trained to separate EARL1 and EARL2 compliant images, slices with a high gradient (i.e. with a high uptake) were chosen: From the forty slices in the middle of a patient, the ten slices with the highest edge intensity values were chosen. In both cases, the mean probability value of the ten slices was used for final classification.

Model performance

For training and cross-validating the CNNs, the training dataset was split in five parts such that each part contained a comparable number of images from each scanner. Five-fold cross validation was performed so that each part served once as validation dataset. The performance of the trained CNN was evaluated by analyzing prediction accuracy. As the softmax layer of the CNN gives certainty about its decision in terms of a probability (1 for a completely certain decision, 0.5 for an uncertain decision), the mean probability of the ten image slices was used for final classification. For the first CNN, all images with a probability equal or above 0.5 were considered as being clinical or EARL compliant images. While for the second CNN, images with a mean probability of 0.4 or higher were considered as being EARL2 compliant while images with a mean probability of 0.6 or higher were considered as EARL1 compliant.

Last, the CNNs were trained on the whole training dataset and then applied to the independent validation datasets.

Cross-validation accuracy

The training and cross-validation accuracies of the first CNN trained to separate EARL compliant and clinical reconstructions are listed in Table 2. Note that the training accuracy represents the accuracy for single slices, while the cross-validation accuracy is calculated image per image by using the mean probability of ten patient slices (as explained above). For all folds, all clinical and all EARL compliant reconstructions were correctly identified.

The accuracy values for the second CNN trained to separate EARL1 and EARL2 compliant images are listed in Table 3. EARL1 compliant images were always correctly identified. In contrast, in each fold, some EARL2 compliant image were incorrectly classified as being EARL1 compliant. The majority of the misclassified images were patients with small tumor lesions and corresponding edge images with low gradient values.

Accuracy independent validation dataset

The accuracy values for the separately acquired independent validation dataset are listed in Tables 4 and 5. For the first CNN, all images were correctly classified. The second CNN trained to separate EARL1 and EARL2 compliant images identified all EARL1 compliant images correctly while 18% of the EARL2 compliant images were incorrectly classified. As in the cross-validation dataset, the incorrectly classified images were patients with low tumor load and therefore low edge values.

Table 4

Accuracy for the first CNN when applied on independent testing data.
Reconstruction setting	Clinical	EARL1	EARL2
Percentage of correctly classified images	100%	100%	100%

Table 5

Accuracy for the second CNN applied on unseen testing data.
Reconstruction setting	EARL1	EARL2
Percentage of correctly classified images	100%	82%

The accuracy of the first CNN when applied to the data acquired on the Siemens Biograph Vision are listed in Table 5 for 30 s, 60 s, 120 s, and 180 s scan duration. For the same scan duration as the training data (120 s) all reconstruction settings were correctly identified.

The scan duration had only for three clinical images acquired at the Biograph Vision with scan duration 180 s an impact on the CNN prediction. The images were incorrectly identified as being EARL compliant by the first CNN, the second CNN identified them to be EARL2 compliant. Also one EARL2 compliant image reconstructed with 30 s scan duration was incorrectly classified as clinical reconstructions while the same scan reconstructed with longer scan durations was correctly classified. For all other images, the scan duration had no impact on the binary decision. However, the CNN probability and therefore the certainty of the CNN dropped around 10% for images with scan durations different from 120 s.

Table 6 contains the accuracy of the CNN trained to separate EARL1 and EARL2 compliant images for different scan durations. For this CNN, the scan duration had no impact on the results. All images were correctly identified for all scan durations.

Table 6

Accuracy of the first CNN applied to data from the Siemens Biograph Vision which was not included in the training data.
	Clinical	EARL1	EARL2
Correctly identified images – 180 s	84%	100%	100%
120 s	100%	100%	100%
60 s	100%	100%	100%
30 s	100%	100%	96%

Table 7

Accuracy of the second CNN applied to images reconstructed with different scan durations acquired on the Biograph Vision.
	EARL1	EARL2
Correctly identified images – 180 s	100%	100%
120 s	100%	100%
60 s	100%	100%
30 s	100%	100%

In this study, we were the first to train, cross-validate, and independently validate an automatic image quality control for PET images. As such the trained CNNs could be used in clinical multi center studies to determine if images from different institutions are in compliance with the EARL standards and can be used together. To make the methods and results publicly available, a python script that takes one image or a series of nifty images as input and displays the corresponding reconstruction setting is available on Zenodo and can be used by the community.

Quality control for PET images is an important task. With the increasing use of machine and deep learning in nuclear medicine, where the amount of training data is limited, it is especially important to use images yielding similar image quality. To date, image quality in PET images can be assessed by drawing manually a sphere in the patient liver and comparing the mean liver intensity value across patients. However, this approach is time consuming. Moreover, the mean liver value gives mainly information about the correct SUV and does not necessarily indicate if the used reconstruction settings are comparable [12].

While details about the used reconstruction setting are listed in the DICOM header, only very experienced users can identify if the reconstruction setting is compliant with EARL standards. Additionally, due to anonymization or conversion to other image formats such as e.g. nifti this information is often incomplete.

In this study, the two dimensional image slices were used as input to the CNN and not –as for other tasks- the 3D information of the whole image. By using 2D slices, the number of training data was enlarged, thereby possibly increasing the classification performance. Moreover, in contrast to segmentation or other tasks, the 3D information of the image is not important as only the noise structure and resolution of the image is important for the classification.

Several strategies were used in this study to guarantee the generalizability of the CNN and to avoid overfitting. As PET images come with a low spatial resolution and less image details than e.g. CT images, a lower number of features and CNN layers might be an advantage [7]. To build generalizable CNNs is an important challenge as CNNs contain normally a large number of learnable features while the amount of training data is often limited. Therefore, the CNN used in this study is very sparse, containing only a few number of layers and learnable features. A large dropout percentage was chosen in order to avoid overfitting. The comparable performance of training, cross-validation, and independent external testing underlines the generalizability of the proposed CNNs. The low amount of layers and features seems appropriate because of the lack of details that are needed to identify the image resolution (i.e. EARL1, EARL2 or clinical image quality). In our experience, when using more features and layers the CNNs learned scanner specific details and were not able to generalize to unseen data.

Some EARL2 compliant images were incorrectly identified as being EARL1 compliant. This was especially the case for patients with only low tumor load. In these cases, the edge images contained only few edges what might be the reason for the misclassification. However, as the CNN made the right decision in the majority of the cases, it may still be possible to assess if an imaging site complies with an EARL standard by looking at the overall classification across all images from an imaging site.

The present study as well as the EARL accreditation focuses on oncological PET images. However, also for neurological images, harmonized image reconstructions are essential in order to compare images across institutions [19]. Therefore, future studies could focus on the use of CNNs to determine image quality for neurological PET images.

One limitation of the present study is that we used only one external validation dataset with images acquired on a different PET/CT scanner. It might be that for images from other scanners, the CNN performance decreases. However, the accuracy in the independent dataset gives an indication that the trained CNNs can generalize well and can be used with data from other scanners. Moreover, we made the trained networks as well as the code used in this paper publically available. Therefore, users can retrain the network with data from other scanners such that the generalizability of the proposed method can be further improved.

We trained, cross-validated and independently validated two consecutive CNNs that can automatically identify EARL compliant images. Moreover, the CNNs are able to separate if an image is meeting older or newer EARL standards. By using the proposed CNNs, images that are not EARL compliant can be adjusted retrospectively (by adding additional smoothing) so that they become EARL compliant, thus giving the opportunity to retrospectively include images in a multi-center setting. This method can be used in multicenter studies where the harmonization of images from different PET systems is essential.

Ethics approval and consent to participate: All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

Consent for publication: Not applicable

Availability of data and materials: The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

Competing interests: The authors declare that they have no competing interests

Funding: Not applicable.

Authors' contributions: EP designed the study, implemented the CNNs, analyzed the data, and wrote the manuscript. DE, AR, JvS, OSH, JZ, and AHB, contributed to the patient inclusion, data acquisition, and manuscript revision. CL contributed to manuscript revision. RB designed and managed the study, and wrote the manuscript.

Acknowledgements: We would like to thank the Center for Information Technology of the University of Groningen for their support and for providing access to the Peregrine high performance computing cluster.

van Elmpt W, Ollers M, Dingemans A-MC, et al (2012) Response Assessment Using 18F-FDG PET Early in the Course of Radiotherapy Correlates with Survival in Advanced-Stage Non-Small Cell Lung Cancer. J Nucl Med 53:1514–1520. https://doi.org/10.2967/jnumed.111.102566
Freudenberg LS, Antoch G, Sch�tt P, et al (2004) FDG-PET/CT in re-staging of patients with lymphoma. Eur J Nucl Med Mol Imaging 31:325–329. https://doi.org/10.1007/s00259-003-1375-y
Fletcher JW, Djulbegovic B, Soares HP, et al (2008) Recommendations on the Use of 18F-FDG PET in Oncology. J Nucl Med 49:480–508. https://doi.org/10.2967/jnumed.107.047787
Weber WA, Schwaiger M, Avril N (2000) Quantitative assessment of tumor metabolism using FDG-PET imaging. Nucl Med Biol 27:683–7
Bailly C, Bodet-Milin C, Bourgeois M, et al (2019) Exploring Tumor Heterogeneity Using PET Imaging: The Big Picture. Cancers (Basel) 11:1282. https://doi.org/10.3390/cancers11091282
Kramer GM, Frings V, Hoetjes N, et al (2016) Repeatability of Quantitative Whole-Body 18F-FDG PET/CT Uptake Measures as Function of Uptake Interval and Lesion Selection in Non-Small Cell Lung Cancer Patients. J Nucl Med 57:1343–1349. https://doi.org/10.2967/jnumed.115.170225
Pfaehler E, Mesotten L, Kramer G, et al (2021) Repeatability of two semi-automatic artificial intelligence approaches for tumor segmentation in PET. EJNMMI Res 11:4. https://doi.org/10.1186/s13550-020-00744-9
T. Berghmans, M. Dusart J sculier et al. (2008) Primary Tumor Standardized Uptake Value ( SUV max ) Measured on Fluorodeoxyglucose Positron Emission Tomography ( FDG-PET ) is of Prognostic Value for Survival in Non-small Cell Lung Cancer ( NSCLC ). J Thorac Oncol 3:6–12. https://doi.org/10.1097/JTO.0b013e31815e6d6b
Hatt M, Tixier F, Pierce L, et al (2017) Characterization of PET / CT images using texture analysis : the past , the present … any future ? Eur J Nucl Med Mol Imaging 44:151–165. https://doi.org/10.1007/s00259-016-3427-0
Biehl KJ, Kong F-M, Dehdashti F, et al (2006) 18F-FDG PET definition of gross tumor volume for radiotherapy of non-small cell lung cancer: is a single standardized uptake value threshold approach appropriate? J Nucl Med 47:1808–1812. https://doi.org/47/11/1808 [pii]
Pfaehler E, van Sluis J, Merema BBJ, et al (2020) Experimental Multicenter and Multivendor Evaluation of the Performance of PET Radiomic Features Using 3-Dimensionally Printed Phantom Inserts. J Nucl Med 61:469–476. https://doi.org/10.2967/jnumed.119.229724
Kuhnert G, Boellaard R, Sterzer S, et al (2016) Impact of PET/CT image reconstruction methods and liver uptake normalization strategies on quantitative image analysis. Eur J Nucl Med Mol Imaging. https://doi.org/10.1007/s00259-015-3165-8
Boellaard R, Krak NC, Hoekstra OS, Lammertsma AA (2004) Effects of noise, image resolution, and ROI definition on the accuracy of standard uptake values: a simulation study. J Nucl Med 45:1519–27
Kaalep A, Sera T, Rijnsdorp S, et al (2018) Feasibility of state of the art PET/CT systems performance harmonisation. Eur J Nucl Med Mol Imaging 45:1344–1361. https://doi.org/10.1007/s00259-018-3977-4
Boellaard R, Delgado-Bolton R, Oyen WJG, et al (2015) FDG PET/CT: EANM procedure guidelines for tumour imaging: version 2.0. Eur J Nucl Med Mol Imaging 42:328–354. https://doi.org/10.1007/s00259-014-2961-x
Kaalep A, Burggraaff CN, Pieplenbosch S, et al (2019) Quantitative implications of the updated EARL 2019 PET-CT performance standards. EJNMMI Phys 6:28. https://doi.org/10.1186/s40658-019-0257-8
Boellaard R, O’Doherty MJ, Weber WA, et al (2010) FDG PET and PET/CT: EANM procedure guidelines for tumour PET imaging: version 1.0. Eur J Nucl Med Mol Imaging 37:181–200. https://doi.org/10.1007/s00259-009-1297-4
Aide N, Lasnon C, Veit-Haibach P, et al (2017) EANM/EARL harmonization strategies in PET quantification: from daily practice to multicentre oncological studies. Eur J Nucl Med Mol Imaging 44:17–31. https://doi.org/10.1007/s00259-017-3740-2
Verwer EE, Golla SS V., Kaalep A, et al (2021) Harmonisation of PET/CT contrast recovery performance for brain studies. Eur J Nucl Med Mol Imaging 48:2856–2870. https://doi.org/10.1007/s00259-021-05201-w

supplemental.docx

Convolutional Neural Networks for Automatic Image Quality Control and EARL Compliance of PET Images

Status:

Version 2

Abstract

Figures

Introduction

Materials And Methods

Datasets

Training and cross-validation

Independent testing datasets

Training and validation of the CNN

Data preparation and augmentation

CNN architecture and training details

Results

Discussion

Conclusion

Declarations

References

Supplementary Files

Status:

Version 2