In patient-based classification, the neural network predicted correctly both the malignant and benign categories with 99.4% accuracy, although the accuracy for equivocal patients was 87.5%. Therefore, an average probability of 95.4% suggests that a CNN may be useful to predict 3-category classification from MIP images of FDG PET. Furthermore, in the prediction of the malignant uptake region, it was classified correctly with probabilities of 97.3% (head-and-neck), 96.6% (chest), 92.8% (abdomen) and 99.6% (pelvic region), respectively. These results suggested that the system may have the potential to help radiologists avoid oversight and misdiagnosis.
To clarify the reasons for the classification failure, we investigated some cases that were incorrectly predicted in Experiment 1. As expected, the most frequent patterns we encountered were strong physiological uptake and weak pathological uptake. In the case shown in Fig. 3a, the physiological accumulation in the oral region was relatively high, which might have caused erroneous prediction. In contrast, another case (Fig. 3b) showed many small lesions with low-to-moderate intensity accumulation, which was erroneously predicted as benign despite the true label being malignant. The equivocal category was more difficult for the neural network to predict; the accuracy was lower than for the other categories. The results may be due to the definition; though common in clinical settings, “equivocal” is a kind of catch-all or “garbage” category for all images not clearly belonging to “malignant” or “benign”; thus, a greater variety of images was included in the equivocal category. We speculate that such a wide range may have made it difficult for the neural network to extract consistent features.
We also conducted patient-based predictions in this study. In patient-based prediction, the accuracy was higher than in image-based prediction by an ensemble effect. This approach takes advantages of MIP images generated from various angles.
The CNN focuses on some features of the images. Grad-CAM is a technology that visualizes the region of interest. The results of Experiment 3 suggested that CNN responded to the part of the malignant uptake if presented. Grad-CAM results would provide physicians information on the mechanisms of the CNN; such information would help physicians decide whether to accept or reject the CNN’s diagnosis.
The computational complexity becomes enormous when a CNN directly learns with 3D images.[22–26] Although we employed MIP images in the current study, an alternative approach may be to provide each slice to the CNN. However, even in the case of ‘malignant’ or ‘equivocal’, the tumor is usually localized in some small area and thus most of the slices do not contain abnormal findings. Consequently, a positive vs. negative imbalance problem would disturb efficient learning processes. In this context, MIP seems to be advantageous for a CNN as most MIP images of malignant patients contain accumulation in the image somewhere unless a stronger physiological accumulation (e.g., brain or bladder) hides the malignant uptake.
We believe that this system will be useful in various clinical situations. First, it can reduce oversight and misdiagnosis by physicians as an automated double-check system. Second, the system can assist less experienced physicians, especially residents, complete radiology reports. Third, it can be used as a triage system to determine priority cases for a radiologist’s review. The radiologist would read those images the CNN classifies as malignant before reading the images of benign-classified patients. This could be highly useful in case that urgent care is needed.
In this study, we used only 2 scanners, but further studies are needed to reveal what will happen when more scanners are investigated. For instance, what if the numbers of examinations from various scanners is imbalanced? What if a particular disease is imaged by some scanners but not by the other scanners? There is a possibility that AI system cannot make a correct evaluation in such cases. The AI system should be tested using “real-world data” before using in clinical settings.
Some approaches could further improve the accuracy. In this research, in order to reduce the learning cost, we used a network that is equivalent to ResNet–34[27], which is a relatively simple version of the “ResNet” family. In fact, ResNet systems with deeper layers can be built technically. More recently, various networks based on ResNet have been developed and demonstrated to have high performance.[28, 29] From the viewpoint of big-data science, it is also important to increase the number of images for further improvement in diagnostic accuracy.
This study has some limitations. First, this model can only deal with FDG PET MIP images in the imaging range from the head to the knees; correct prediction is much more difficult when spot images or whole-body images from the head to the toes are given. Future studies will use regional CNN (RCNN) to solve the problem. Second, low-accumulation lesions such as pancreatic cancer cannot be classified only with MIP images, and there is a possibility that it cannot be labeled correctly. Third, the cases were classified by a nuclear medicine physician but were not based on a pathological diagnosis.