Articial Intelligence-Based Assistance in Clinical 123I-FP-CIT SPECT Scan Interpretation

Purpose: Dopamine transporter (DAT) imaging with 123 I-FP-CIT SPECT is used to support the diagnosis of Parkinson’s disease (PD) in clinically uncertain cases. Previous studies showed that automatic classication of 123 I ‐ FP ‐ CIT SPECT images (marketed as DaTSCAN) is feasible by using machine learning algorithms. However, these studies lacked sizable use of data from routine clinical practice. This study aims to contribute to the discussion whether articial intelligence (AI) can be applied in clinical practice. Moreover, we investigated the need for hospital specic training data. Methods: A convolutional neural network (CNN) named DaTNet-3 was designed and trained to classify DaTSCAN images as either normal or supportive of a dopaminergic decit. Both a multi-site data set (n = 2412) from the Parkinson’s Progression Marker Initiative (PPMI) and an in-house data set containing clinical images (n = 932) obtained in routine practice at the St Antonius hospital (STA) were used for training and testing. STA images were labeled based on interpretation by nuclear medicine physicians. To investigate whether indeterminate scans effects classication accuracy, a threshold was applied on the output probability. Results: DaTNet-3 trained with STA data reached an accuracy of 89.0% in correctly identifying images of the clinical STA test set as either normal or with decreased striatal DAT binding (98.5% on the PPMI test set). When thresholded, accuracy increased to 95.7%. This increase was not observed when trained with PPMI data, indicating the incorrect images were condently classied as the incorrect class. Conclusion: Based on results of DaTNet-3 we conclude that automatic interpretation of DaTSCAN images with AI is feasible and robust. Further, we conclude DaTNet-3 performs slightly better when it is trained with hospital specic data. This difference increased when output probability was thresholded. Therefore we conclude that the usability of a data set increases if it contains indeterminate images.


Introduction
Parkinson's Disease (PD) is the fastest growing neurological condition in the world with prevalence rates increasing by about 74% from 1990 to 2016 [1]. PD is known for its distinct pathological changes, such as the degeneration of dopaminergic nigrostriatal neurons, projecting from the substantia nigra to the striatum of the brain. Early differentiation between patients with degeneration of dopaminergic neurons, and those without degeneration, is important for prognosis and treatment management. Dopamine transporter (DAT) imaging, using 123 I-FP-CIT single photon emission computed tomography (SPECT) (marketed as DaTSCAN), is currently the standard neuroimaging technique to support or exclude the diagnosis of dopaminergic de cit, consistent with PD and atypical parkinsonism, in clinically unclear cases. Varying inter-observer variability of human readers and the dependence on experienced nuclear medicine physicians make DaTSCAN [2,3] interpretation an interesting task for arti cial intelligence (AI) assisted classi cation. In literature several machine learning algorithms regarding this task are reported and show good performance [4][5][6][7]. However, there is a lack of studies evaluating the performance of training and testing on sizable clinical data sets. Therefore, in this study we investigate and compare the need for data from clinical practice against a study data set aggregated from varying sources.
In this study we have thus designed a convolutional neural network (CNN) model for DaTSCAN interpretation. The model is trained with a publicly available data set and an in-house set of DaTSCAN images obtained in routine practice. In this study we investigated whether it is feasible to reliably classify DaTSCAN images using a CNN. Furthermore, we studied whether the model needs to be trained with camera and department speci c data or whether a multi-site study data set can be used as a training set.
Additionally, the effects of output probability on indeterminate images was studied.

Materials & Methods
DaTSCAN is a well-validated imaging tool used to investigate the loss of nigrostriatal dopaminergic neurons, by assessing DAT binding in the striatum. After injection of the radiotracer, SPECT imaging is performed, typically 3 to 4 h after injection [8], to create a 3D sliced representation of the striatal DAT binding.

Data
Parkinson's Progression Markers Initiative PPMI A set of images was retrieved from the Parkinson's Progression Markers Initiative (PPMI) database [9]. This data set is used by many earlier DaTSCAN classifying studies [4][5][6][7] and is useful as a multi-site study data set from varying sources and for benchmarking. PPMI is a longitudinal study designed to assess the progression of PD using clinical features, biological markers and imaging data [9]. Acquisition protocols for the DaTSCAN imaging varied between originating centers. Yet, all centers used an 128 x 128 matrix, between 90 and 120 projections and an energy window centered on 159 +/-10% KeV. Images were reconstructed using ltered back-projection or iterative reconstruction and are spatially normalized by registration to Montreal Neurological Institute (MNI) space by using PMOD (PMOD Technologies, Zurich, Switzerland) [9].
Images from the PPMI data set (dimension: 91 x 109 x 91, voxel size 2 x 2 x 2 mm 3 ) were further processed by extracting the binding region of the DAT-rich striatum which was assessed to be positioned in the same 20 slices in MNI space. Finally, images were downscaled to dimensions of 17 x 23 x 20.
This data set is referred to as the PPMI data set. The data set contains 351 normal control (NC) images originating from healthy controls and scans obtained in patients without evidence for dopaminergic de cit (SWEDD cohort) and 1422 dopaminergic de cit (DD) images originating from early PD patients.
Images obtained in routine practice 671 DaTSCAN images used in the differential diagnosis of clinically unclear patients were included retrospectively (December 2011 to February 2021) and these were acquired at the St. Antonius Hospital, Nieuwegein. Data was acquired by the use of a double-head SPECT system (Siemens Symbia T2) with low-energy, high-resolution collimators. Scans were made 4 h after intravenous injection of ~ 185 MBq 123 I-FP-CIT, according to common guidelines [8,10]. A total of 120 projections were acquired at 60 s per view for patients (128x128 matrix, zoom = 1). All SPECT images were reconstructed using 3D orderedsubsets-expectation-maximization (3D OSEM), using 4 iterations, 8 subsets and scatter correction. Reconstructions were ltered with an 8.4 mm Gaussian lter and CT-based attenuation correction was performed.
All images were spatially normalized by registration to MNI space by using SimpleITK [11]. The same processing steps were performed as used on the PPMI data set, resulting in nal preprocessed images with the same dimensions and voxel size.
This data set is referred to as the St Antonius (STA) data set. Categorization of the clinical images into normal control (NC) (n = 377) and dopaminergic de cit (DD) (n = 294) was done based on the result of the original report by nuclear medicine physicians, using both visual and quantitative assessment. As this data set contains scans of clinically unclear patients and classi cations are based on the interpretation reporting of single nuclear medicine physicians, it could contain incorrect classi cations of indeterminate scans; such scans are referred to as indeterminates. The study protocol was examined by the Medical Research Ethics Committees United (institutional review board) of the St Antonius hospital, and they determined that, due to the nature of the research and since all patient data was fully anonymized, informed consent of the participants was waived. This study was conducted according to the Declaration of Helsinki.

Convolutional Neural Network Architecture
A CNN model, named DaTNet-3, was designed (summarized in Fig. 1), partly derived from the network by Mohammed et al [4], which in itself is a modi ed version of the AlexNet [12].
The DaTNet-3 architecture consists of three 3D convolutional layers, each followed by a max pooling layer (kernel size: 3 x 3 x 3, 128 feature maps). Feature maps are reduced by max pooling layers (taking the highest value of each feature map lter patch) for robustness [13]. In contrast to previous DaTSCAN classifying networks [4,5], which normalize data in preprocessing, DaTNet-3 uses three batch normalization layers. This allows the distinguishable image feature to come more forward and lessen the effects of differing imaging sources [14]. Recti ed linear units (ReLu) activations layers were added for each of the three layers, providing sensitive neuron activation and a lower computational cost, but avoiding easy saturation to a particular class [12]. Moreover, dropout was implemented in each of the three layers. Using dropout, noise is introduced to a part of the feature map inputs with a chance of 10% for additional regularization [15]. The nal layer contains a global average pooling layer which down samples all feature maps to a single average value. This enforces a relation between previously generated feature maps and the output, allowing the interpretation of con dence for classi cations [16]. The globally averaged value is fed into the output layer using a sigmoid function resulting in an output between 0 and 1. Values under 0.5 are interpreted as NC while values above are considered images with DD. The closer the output is to either 0 or 1, the more likely the image belongs to the particular class.

DaTNet-3 training and setup
DaTNet-3 was trained and constructed using Tensor ow [17]  For training and testing DaTNet-3 models, both data sets were split into training and test sets. From both the PPMI and STA data sets a test set of 200 images was randomly retrieved with equal allocation of classes. Training images of the STA and PPMI data sets were randomly augmented to increase the amount of images, as machine learning models generalize better with larger data sets [19]. STA image count was doubled to 932 images (550 NC and 382 DD) using horizontal ips and randomization of intensity and brightness. Training images of the PPMI data set were similarly augmented, but only on the images obtained in the NC class to decrease the class imbalance. PPMI data set size was increased to 2412 images (1240 NC and 1172 DD).
Two DaTNet-3 models were generated and trained with different data input; DaTNet3_STA was trained using only training images of the STA data set and DaTNet-3_PPMI was trained using only PPMI training images. The results of DaTNet-3_PPMI on the PPMI test set were used to benchmark our model against similar DaTSCAN classifying studies. The results of DaTNet3_STA and DaTNet-3_PPMI on the STA test set were used to assess the usability in clinical practice and to investigative if models trained on multiclinic study data can work well on data obtained in routine practice.

DaTNet-3 model performance parameters
To evaluate the models, accuracy, sensitivity and speci city were calculated. To visualize these metrics, confusion matrices were plotted. Confusion matrices show performance with a xed threshold set at 0.5, allowing easy interpretation of accuracy, sensitivity and speci city. Because the STA data set contains indeterminate scans, which show little decrease in DAT binding and are therefore hard to interpret by nuclear medicine physicians, classi cations can be inconclusive or only suggestive on the presence or absence of a DD. To investigate the effect these scans have on output, a probability threshold was implemented that ltered out indeterminate STA testing images with a threshold between 0.2-0.8.

Results
Two models (DaTNet-3_PPMI and DaTNet-3_STA) were trained with differing data set inputs and evaluated on their ability to correctly label DaTSCAN images based on the presence or absence of a DD.
The models were tested on 2 testing data sets containing STA images for measuring the performance on data obtained in routine practice and PPMI images for benchmarking against other DaTSCAN classi ers.
Performance evaluation PPMI / clinical data Accuracy, sensitivity and speci city of the DaTNet-3_PPMI model on the DaTNet-3_PPMI test set were 98.5%, 100% and 97%, respectively. In Table 1 these results are compared with previous DaTSCAN classifying studies. In Table 2, the performance of the two models on the clinical test data set is shown and confusion matrices are plotted in Fig. 2 The presently presented DaTNet-3_STA model performs slightly better (accuracy 89%) than the DaTNet-3_PPMI model (accuracy 84%). Furthermore, the accuracy of the DaTNet-3_STA model is compared to the published interobserver agreement in Table 3.   Fig. 2) that were misclassi ed have a lower probability, thresholding was implemented by ltering out images with class probability between 0.2-0.8. The results can be seen in Table 4. As can be seen in Table 4, the accuracy increased after thresholding from 89.0-95.7% for the DaTNet-3_STA model. This means that images with an uncertain class probability (0.2 to 0.8) have a higher likelihood of being misclassi ed. This increase was not seen in Table 4 for the PPMI trained model, which shows very little increase in accuracy (84.5-85.8%).

Discussion
In this study we designed a CNN second reader for automatic classi cation of DaTSCAN images. The DaTNet-3_PPMI model was trained with multi-site study data of the PPMI data set, and the DaTNet-3_STA model with hospital speci c data obtained in routine practice.  [7] was outperformed by DaTNet-3. They used striatal binding ratio (SBR) values rather than the images themselves. These imaging features can be affected by changes in reconstruction and normalizations steps [20,21], making it less robust than our DaTNet-3 CNN method. It is worth noticing that some studies used 10-fold cross validation [4,7] and another study was hampered by a small testing sample size [5]. This could lead to an overestimation of the performance of these studies, which makes our conclusion that our DaTNet-3 model is at least on-par or even better than these studies and more reliable. Moreover, since our model is tested on a larger test set, and without 10-fold cross validation, we contribute to the evidence that AI can be potentially feasible for classifying DaTSCAN images.
To study whether an AI model can be successfully trained with multi-site study data, we assessed the performance DaTNet3_PPMI and DaTNet-3_STA on STA images obtained in routine practice. The results in Table 2 show that DaTNet-3 trained with PPMI images performs slightly less accurately (84.5%) compared to DaTNet-3 trained with clinical STA images (89.0%). However, it is noticed that the DaTNet-3_PPMI model has a very high speci city (98%), which implies that if the DaTNet-3_PPMI model predicts that a STA scan has normal DAT binding, and consequently does not support the clinical diagnosis of PD, this is correct in 98% of cases.
We further studied the effects on performance of both models on classifying scans as uncertain, by thresholding output probability. The clinically trained model, DaTNet-3_STA, showed a large increase in performance (from 89.0-95.7%). This indicates that incorrect classi cations were substantially between the 0.2-0.8 uncertain probability threshold. These results signify the importance of training with data that contains more indeterminate scans, as this allowed the model to get better insights into such images.
The lack of accuracy increase using DaTNet-3_PPMI (from 84.5-85.8%) suggests that the model trained with PPMI images is incapable of classifying the indeterminate STA images with uncertain probability.
Rather it classi es incorrect classi cation as con dently (meaning with a threshold under 0.2 or above 0.8) NC or with DD. The difference is probably explained by the fact that the PPMI data set merely contains manifest PD and normal images, in contrast to the STA data set that also contains also other parkinsonian syndromes and indeterminate images. Therefore, we conclude that the usability of a multisite study data set as a training dataset can improve if indeterminate images are included as well, which is common practice in routine clinical studies.
The performance of DaTNet-3 can be compared to the published interobserver agreement of DaTSCAN studies by Tondeur et al [2] and Booij et al [3] (37-100% and 0.74-0.93 inter-observer agreement/kcoe cients for reader pairs, respectively). This comparison may indicate that DaTNet-3 shows high agreement in the interpretation made by the nuclear medicine physicians. Therefore DaTNet3 is potentially valuable as a second reader for reading DaTSCAN images in routine practice. Also, it is noticed that both the models have a very high speci city for classifying images as being normal, therefore a potential application of such an AI tool could be to detect normal non-de cit images with high con dence.
In spite of the high accuracy, DaTNet-3 does not take patient age into account, which is useful for differentiating age-related from parkinsonian-related dopaminergic depletion [22]. Future studies could include patient's age as an extra parameter for model input to further improve its accuracy.
Furthermore, it has to be noticed that all our testing and training data originates from general purpose SPECT cameras. Images from dedicated brain scanners, like the InSPira HD SPECT system [23], might lead to diagnoses that are based on detailed local diminished uptake, as can be seen with such highresolution systems. Preliminary results (data not shown) of our model using a test set of DaTSCAN images acquired on the InSPira HD SPECT system indeed showed that the model did not generalize well enough for data from a dedicated brain scanner. Therefore, in future studies we propose to set-up a multisite study data set with data from both general and brain-dedicated systems to evaluate if it can perform accurately also on data obtained ion brain-dedicated systems.
Currently, one of the largest hurdles in AI is retrieving useable historic data from usually unstructured sources [24] and much time was spent on creating processed and correctly labeled training data. In this study, there is a lack of detailed and non-subjective DaTSCAN interpretation reporting, making it di cult to precisely classify images. In the future, the use of a prede ned 5 stage degeneration scale [25] stored in a structured format could be considered to improve model input.

Conclusions
A CNN model, DaTNet-3, was designed and used for the automatic interpretation of DaTSCAN images. Based on the results of our model we conclude that automatic interpretation of DaTSCAN images with AI is feasible. DaTNet-3 may show great potential in increasing diagnostic con dence. By acting as an automated secondary reader, nuclear medicine physicians can get more con dence in their diagnosis without the need for complicated image feature selection tools. Further we conclude that our model performs slightly better when DaTSCAN images from clinical practice are used. This difference increased when the effects of indeterminate images in classi cation reporting is investigated through output probability thresholding. This allowed insights into probability of images that could be incorrect classi cations. Therefore, we conclude that the usability of a PPMI study data set as an AI training-set All other authors declare that they have no con ict of interest.

Availability of data and material
In-house STA DaTSCAN images are not available and con dential.