2.1 Patient population
In this study, 204 subjects (118 men, 86 women, 61.6±12.9 years, range 20-90 years), were referred to the Center for Nuclear Medicine & PET/CT PositronMed, Santiago, Chile, for PET/CT imaging with [18F]PR04.MZ to evaluate nigrostriatal neuronal integrity. The compiled database was split into two groups: 129 PET scans (63%) destined to train the models and 75 (37%) to test them.
All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and national research committee and with the principles of the 1964 Declaration of Helsinki and its later amendments or comparable ethical standards. This is a retrospective analysis of previously acquired PET data approved by the regional ethics committee board (CEC SSM Oriente, permit 20140520) and written informed consent has been obtained from all participants.
2.2 PET Imaging acquisition.
All images were acquired on a state-of-the-art WB PET/CT scanner (Biograph mCT Flow, Siemens Healthineers, Erlangen, Germany). Patients were allowed to rest for 55 min after injection of 4.99±0.82 mCi [18F]PR04.MZ (184.0±30.0 MBq, range: 94.7-234.6 MBq) before being placed head-first supine in the PET/CT scanner. A low-dose CT was performed for attenuation correction before the emission scan was performed in LIST-mode from 60-90 min post-injection (p.i.). PET images were corrected for random, scatter, attenuation, and time-of-flight, reconstructed by an ordered subset expectation maximization (OSEM) algorithm (2 iterations and 21 subsets) followed by post-reconstruction smoothing (Gaussian, 4 mm FWHM), corrected for motion if necessary, and averaged, following the manufacturer’s instructions (Siemens). Images consisted of 110 planes of 256x256 voxels of 1.59x1.59x1.5 mm3. PET scans were reframed into 6 static frames of 300 sec each, reviewed, corrected for motion and averaged for the period of 60-90 min post-injections. PMOD software (Version 3.4, Zurich, Switzerland: PMOD Technologies LLC. Available from: https://www.pmod.com/) was used to co-registered images to CT scans and to normalize to Montreal Neurological Institute (MNI) space. Volumes of interest (VOI) were outlined for the left and right anterior, posterior, and total putamen, caudate nucleus, SNpc, and cerebellum. Specific binding ratios (SBR) were calculated from SUVmean values as follows:
2.3 General workflow
PET imaging technique with [18F]PR04.MZ tracer implemented to acquire a PET image dataset. Later, three independent expert readers assessed and labeled two-hundred-four PET images. The images were preprocessed to generate 2D and 3D datasets for the subsequent training and analysis of five different ML models. Final statistical analysis was performed to evaluate accuracy, precision, recall, F1-score and AUC (Figure 1).
2.4 Classification of PET images
All PET images (axial and coronal views) and quantitative signal-to-background-ratio (SBR) values (Figure 2) were assessed and classified by three blinded expert readers. A deficit of >40% (eq. to 2 x SD) from the mean value of the healthy control group in any region was considered pathological.
Each scan was assigned as follows: normal (0) or abnormal (1), showing absence (0) or presence (1) of asymmetric binding, absence (0) or presence (1) of a rostro-caudal gradient, and being compatible (1) or not (0) with PS. The final label was assigned to a scan without evidence for dopaminergic deficit (0) or a scan showing dopamine deficiency compatible with PS, based on the agreement of at least two readers.
2.5 Imaging preprocessing and feature selection: 2D and 3D datasets
The data preprocessing pipeline aims to submit two distinct dataset dimensions, therefore, each input was processed differently (Figure 3). For the 2D dataset, the preprocessing stage focused on selecting the region of interest (ROI), targeting the striatal and midbrain regions within each image. To achieve this, images underwent initial cropping and resizing to homogenize the input data. As a result, reduced pseudo-color images with 80 x 40 resolution were obtained (Figure S1).
The preprocessing of the 3D dataset focused on the volume of interest (VOI), including 10 consecutive 2D images, comprising the striatal and midbrain regions. The unsupervised model applied to achieve this selection is described as follows:
1. From the acquired 3D image, 45% and 30% were removed from the upper and lower slices, respectively. This standardized inner volume was used as a first filter to select the VOI (Figure 4).
2. Density-based spatial clustering of applications with noise (DBSCAN) is a technique of cluster building around an arbitrary initial point. The parameter ‘eps’ marks the neighborhood radius of the point to analyze if it has ‘min samples’, the minimum number of points that the cluster can contain; if not, it’s considered as noise. The sklearn.cluster.DBSCAN function was used to select the slice with the highest density as the central slice of the output-optimized brain box. Therefore, the cluster show the tracer uptake in the target region of interest (ROI), i.e. caudate nucleus, anterior and posterior putamen of both left and right brain hemispheres (Figure 5).
3. Finally, a set of ten slices were selected to wrap the slice with the highest density, resulting in a reduced resolution scan output, as illustrated in Figure 6.
The success of this process allowed the reduction of the input images for the machine learning classification, before splitting them into the training/validation/test set for the supervised learning classification.
2.6 Machine learning model pipeline
After the dataset labeling and image preprocessing, 63% of the images were exclusively used to build a trained model using ML. To achieve our goal we compared five different classifiers (SVM, random forest, logistic Regression, K-NN and neural network) and systematically performed a 10-fold cross-validation method (Figure 7).
All classifiers' performance were evaluated on a set of the remaining images (37%) using the labels assigned by expert readers (Step 6).
2.6.1 2D-dataset
The 2D dataset was divided as described in Table 1. A total of 129 2D images were assigned to the training set, while the remaining 75 were for testing.
Table 1: Group composition and demographic details
|
SWEDD Training
|
PS
Training
|
SWEDD Testing
|
PS
Testing
|
Amount
|
54
|
75
|
39
|
36
|
Age
|
63.3±13.9
|
59.0±13.9
|
60.9±11.5
|
65.3±9.5
|
During the preprocessing stage (Figure 3), the ROI was selected and, prior to resizing the scans to a lower resolution of 40 x 80, the margins of each pseudo-color image were removed. This step allowed the input homogenization before running the algorithm.
To avoid overfitting, the training process systematically incorporated the 10-fold cross-validation technique, where the training dataset was splitted into 90% to train the model and 10% to validate its performance (Figure 7).
For training of the preprocessed images (step 5 in Figure 1), Python 3.8.2 scikit-learn functions (http://scikit-learn.org) were used to implement the ML models. To determine the optimal model, GridSearchCV was used to find the best parameter grid across all possible parameter value combinations. The selected parameters are shown in Table 2.
Table 2: Best parameter selection for each machine learning model using 2D dataset.
Model
|
GridSearchCV selection
|
SVM
|
SVC(C=1, kernel='linear', probability=True, random_state=21)
|
RF
|
RandomForestClassifier(criterion='entropy', random_state=42)
|
LR
|
LogisticRegression(C=5, penalty='l1', random_state=21, solver='liblinear')
|
K-nn
|
KNeighborsClassifier(n_neighbors=1)
|
TensorFlow keras (https://www.tensorflow.org/ guide/keras/sequential_model) was used to implement the neural network. The neural network was composed of three convolutional layers, each one with a 3x3 kernel layer, a ReLU activation function, followed by one max pooling layer size of 2x2 and one flatten with activation ‘relu’. The classification layer included a dense layer, activation function ‘sigmoid’, that allows full connectivity between neurons in preceding and succeeding layers. Finally, we compiled the model using a ‘binary_crossentropy’ loss, ‘adam’ optimizer and measured its performance with metric ‘accuracy’.
Each of the models were iteratively executed and validated, comparing their performance in terms of precision, recall, F1- score and accuracy (Figure 7).
2.6.3 3D-dataset
The results of the preprocessed 3D dataset were submitted as input for the supervised classifier. This 3D array was flattened into a pixel vector. Similar to the approach used for the 2D dataset, all of the machine learning models applied to the 3D-dataset employed the same hyper-parameter optimization grid search technique (Table 3). Also, a 10-fold cross-validation technique was used to prevent overfitting and create a generalized model, capable of adapting to unseen data. The models were later compared using the previously mentioned indicators: precision, recall, F1- score and accuracy.
Table 3: Best parameter selection for each machine learning model using 3D dataset.
Model
|
GridSearchCV selection
|
SVM
|
SVC(C=1, kernel='linear', probability=True, random_state=0)
|
RF
|
RandomForestClassifier(n_estimators=45, random_state=0)
|
LR
|
LogisticRegression(C=1, penalty='l1', random_state=0, solver='liblinear')
|
K-nn
|
KNeighborsClassifier(n_neighbors=2, weights='distance')
|
To select the best neural network combination of hyperparameters, the Ax Adaptive Experimentation Platform (https://ax.dev), which simplifies the search for an optimal neural network configuration, was used. The metric defined in the network compiler seeks a ‘binary accuracy’ optimization. AX library.get_next_trial() allows to iteratively create a neural network with a combination of parameters for 25 experimental trials. Finally, ax client.get_best parameters select the best set of parameters (Table 4).
Table 4: Summary of the best set of parameters for neural network binary accuracy optimization
Hyperparameter
|
Best combination
|
`learning rate'
|
0.00013850114679874224
|
`dropout rate'
|
0.010574775359670174
|
`num hidden layers'
|
5
|
`neurons per layer'
|
17
|
`batch size'
|
8
|
`activation'
|
'tanh'
|
`optimizer'
|
‘rms’
|
`keras cv'
|
1.0000164359863488
|
To address overfitting, Ax platform was set to assign a ‘learning_rate’ within the range [0.0001, 0.001], and a ‘dropout_rate’ within [0.01, 0.05]. According to Table 4, 0.013% is the best learning rate to get the minimum loss function, also meaning more epochs and computational resources will be needed, while 1.06% dropout rate is the percentage of neurons dropped between the three hidden layers. Also, ‘EarlyStopping’ (https://keras.io/api/callbacks /early_stopping) with (patience=20) was included in the models’ callback, which prevents it from continuing to train once it has stopped improving.
Accuracy and loss value metrics were used to select the best model. Graphing both indicators allowed the comparison of the model performance with the training and validation data after each iteration. Accuracy facilitates the visualization of how precisely it is to classify the pattern of an image; meanwhile, loss value indicates the error that adds each training epoch. The convergence behavior of both graphs issues the best model for later classification of the test set. After setting the model to train for 500 epochs, and configuring early stopping (patience=20), the algorithm converged as shown in Figure 8, and was ready for final classification.
2.7 Statistical analysis
The inter-observer reliability between the clinicians and the image-based model identification was calculated using Cohen’s exact Kappa (κ) [54]. We considered the level of agreement according to the following value ranges for κ: slight (0.00 to 0.20), fair (0.21 to 0.40), moderate (0.41 to 0.60), substantial (0.61 to 0.80), and almost perfect (0.81 to 1.00) [56]. Calculations were performed using 95% confidence intervals (CI’s), and p<0.05 were considered significant. The statistical analysis was performed using R ”IRR”–CRAN, package 3.5.1 version (http://www.R-project.org).
To analyze the performance of the five different models, accuracy (1), precision (2), recall (3) and F1-score (4) were evaluated using the information for true positive (TP), true negative (TN), false positive (FP), and false negative (FN) and according to the following equations:
Where accuracy measures the percentage of samples the model has correctly classified, precision is the quality of the ML model in classification tasks rated according to the number of TP in a total of PS predictions, recall reports the number of PS subjects the ML model can identify and F-score measuring the performance of precision and recall between models.
In addition to the metrics mentioned above, Receiver Operating Characteristics (ROC) curves were also displayed. By showing the True Positive Rate (TPR) and False Positive Rate (FPR), ROC curves help to evaluate the performance of each classifier. That’s mainly why ROCs are commonly useful in binary classification problems. In an illustration, the “perfect” ROC curve is the one closer to the top-left corner of the image, since the ideal scenario implies having high TPR and no FPR.
The Area Under the curve (AUC), in this case the ROC curve, rates on a scale from 0 to 1 the value of the area covered by the curve. This metric, along with other indicators, contribute to the comparison between models’ performance.