In this study we designed a CNN second reader for automatic classification of DaTSCAN images. The DaTNet-3_PPMI model was trained with multi-site study data of the PPMI data set, and the DaTNet-3_STA model with hospital specific data obtained in routine practice. The performance of the DaTNet-3_PPMI model was assessed on the PPMI-test set to benchmark performance to other DaTSCAN classifying studies. As is shown in Table 1, DaTNet-3_PPMI performed accurately in the classification of PPMI images, similarly to the results of the studies by Mohammed et al [4] and Ding et al [6]. The model by Prashanth and co-workers [7] was outperformed by DaTNet-3. They used striatal binding ratio (SBR) values rather than the images themselves. These imaging features can be affected by changes in reconstruction and normalizations steps [20, 21], making it less robust than our DaTNet-3 CNN method. It is worth noticing that some studies used 10-fold cross validation [4, 7] and another study was hampered by a small testing sample size [5]. This could lead to an overestimation of the performance of these studies, which makes our conclusion that our DaTNet-3 model is at least on-par or even better than these studies and more reliable. Moreover, since our model is tested on a larger test set, and without 10-fold cross validation, we contribute to the evidence that AI can be potentially feasible for classifying DaTSCAN images.
To study whether an AI model can be successfully trained with multi-site study data, we assessed the performance DaTNet3_PPMI and DaTNet-3_STA on STA images obtained in routine practice. The results in Table 2 show that DaTNet-3 trained with PPMI images performs slightly less accurately (84.5%) compared to DaTNet-3 trained with clinical STA images (89.0%). However, it is noticed that the DaTNet-3_PPMI model has a very high specificity (98%), which implies that if the DaTNet-3_PPMI model predicts that a STA scan has normal DAT binding, and consequently does not support the clinical diagnosis of PD, this is correct in 98% of cases.
We further studied the effects on performance of both models on classifying scans as uncertain, by thresholding output probability. The clinically trained model, DaTNet-3_STA, showed a large increase in performance (from 89.0–95.7%). This indicates that incorrect classifications were substantially between the 0.2–0.8 uncertain probability threshold. These results signify the importance of training with data that contains more indeterminate scans, as this allowed the model to get better insights into such images.
The lack of accuracy increase using DaTNet-3_PPMI (from 84.5–85.8%) suggests that the model trained with PPMI images is incapable of classifying the indeterminate STA images with uncertain probability. Rather it classifies incorrect classification as confidently (meaning with a threshold under 0.2 or above 0.8) NC or with DD. The difference is probably explained by the fact that the PPMI data set merely contains manifest PD and normal images, in contrast to the STA data set that also contains also other parkinsonian syndromes and indeterminate images. Therefore, we conclude that the usability of a multi-site study data set as a training dataset can improve if indeterminate images are included as well, which is common practice in routine clinical studies.
The performance of DaTNet-3 can be compared to the published interobserver agreement of DaTSCAN studies by Tondeur et al [2] and Booij et al [3] (37–100% and 0.74–0.93 inter-observer agreement/k-coefficients for reader pairs, respectively). This comparison may indicate that DaTNet-3 shows high agreement in the interpretation made by the nuclear medicine physicians. Therefore DaTNet3 is potentially valuable as a second reader for reading DaTSCAN images in routine practice. Also, it is noticed that both the models have a very high specificity for classifying images as being normal, therefore a potential application of such an AI tool could be to detect normal non-deficit images with high confidence.
In spite of the high accuracy, DaTNet-3 does not take patient age into account, which is useful for differentiating age-related from parkinsonian-related dopaminergic depletion [22]. Future studies could include patient’s age as an extra parameter for model input to further improve its accuracy.
Furthermore, it has to be noticed that all our testing and training data originates from general purpose SPECT cameras. Images from dedicated brain scanners, like the InSPira HD SPECT system [23], might lead to diagnoses that are based on detailed local diminished uptake, as can be seen with such high-resolution systems. Preliminary results (data not shown) of our model using a test set of DaTSCAN images acquired on the InSPira HD SPECT system indeed showed that the model did not generalize well enough for data from a dedicated brain scanner. Therefore, in future studies we propose to set-up a multi-site study data set with data from both general and brain-dedicated systems to evaluate if it can perform accurately also on data obtained ion brain-dedicated systems.
Currently, one of the largest hurdles in AI is retrieving useable historic data from usually unstructured sources [24] and much time was spent on creating processed and correctly labeled training data. In this study, there is a lack of detailed and non-subjective DaTSCAN interpretation reporting, making it difficult to precisely classify images. In the future, the use of a predefined 5 stage degeneration scale [25] stored in a structured format could be considered to improve model input.