In this study, we were the first to train, cross-validate, and independently validate an automatic image quality control for PET images. As such the trained CNNs could be used in clinical multi center studies to determine if images from different institutions are in compliance with the EARL standards and can be used together. To make the methods and results publicly available, a python script that takes one image or a series of nifty images as input and displays the corresponding reconstruction setting is available on Zenodo and can be used by the community.
Quality control for PET images is an important task. With the increasing use of machine and deep learning in nuclear medicine, where the amount of training data is limited, it is especially important to use images yielding similar image quality. To date, image quality in PET images can be assessed by drawing manually a sphere in the patient liver and comparing the mean liver intensity value across patients. However, this approach is time consuming. Moreover, the mean liver value gives mainly information about the correct SUV and does not necessarily indicate if the used reconstruction settings are comparable [12].
While details about the used reconstruction setting are listed in the DICOM header, only very experienced users can identify if the reconstruction setting is compliant with EARL standards. Additionally, due to anonymization or conversion to other image formats such as e.g. nifti this information is often incomplete.
In this study, the two dimensional image slices were used as input to the CNN and not –as for other tasks- the 3D information of the whole image. By using 2D slices, the number of training data was enlarged, thereby possibly increasing the classification performance. Moreover, in contrast to segmentation or other tasks, the 3D information of the image is not important as only the noise structure and resolution of the image is important for the classification.
Several strategies were used in this study to guarantee the generalizability of the CNN and to avoid overfitting. As PET images come with a low spatial resolution and less image details than e.g. CT images, a lower number of features and CNN layers might be an advantage [7]. To build generalizable CNNs is an important challenge as CNNs contain normally a large number of learnable features while the amount of training data is often limited. Therefore, the CNN used in this study is very sparse, containing only a few number of layers and learnable features. A large dropout percentage was chosen in order to avoid overfitting. The comparable performance of training, cross-validation, and independent external testing underlines the generalizability of the proposed CNNs. The low amount of layers and features seems appropriate because of the lack of details that are needed to identify the image resolution (i.e. EARL1, EARL2 or clinical image quality). In our experience, when using more features and layers the CNNs learned scanner specific details and were not able to generalize to unseen data.
Some EARL2 compliant images were incorrectly identified as being EARL1 compliant. This was especially the case for patients with only low tumor load. In these cases, the edge images contained only few edges what might be the reason for the misclassification. However, as the CNN made the right decision in the majority of the cases, it may still be possible to assess if an imaging site complies with an EARL standard by looking at the overall classification across all images from an imaging site.
The present study as well as the EARL accreditation focuses on oncological PET images. However, also for neurological images, harmonized image reconstructions are essential in order to compare images across institutions [19]. Therefore, future studies could focus on the use of CNNs to determine image quality for neurological PET images.
One limitation of the present study is that we used only one external validation dataset with images acquired on a different PET/CT scanner. It might be that for images from other scanners, the CNN performance decreases. However, the accuracy in the independent dataset gives an indication that the trained CNNs can generalize well and can be used with data from other scanners. Moreover, we made the trained networks as well as the code used in this paper publically available. Therefore, users can retrain the network with data from other scanners such that the generalizability of the proposed method can be further improved.