Deep Learning-Based Fast TOF-PET Image Reconstruction Using Direction Information

10 Deep learning has attracted attention for positron emission tomography (PET) 11 image reconstruction task, however, it remains necessary to further improve the image 12 quality. In this study, we propose a novel CNN-based fast time-of-flight PET (TOF- 13 PET) image reconstruction method to fully utilize the direction information of 14 coincidence events. The proposed method inputs view-grouped histo-images into a 3D 15 CNN as a multi-channel image to use the direction information of coincidence events. 16 We evaluated the proposed method using Monte Carlo simulation data obtained from a 17 digital brain phantom. Compared to the case without it, when using direction 18 information, the peak signal-to-noise ratio and structural similarity were improved by 19 1.2 dB and 0.02, at a coincidence time resolution of 300 ps. The calculation times of the 20 proposed method were significantly faster than the conventional iterative 21 reconstruction. These results indicate that the proposed method improves both the speed 22 and image quality of TOF-PET image reconstruction. 23


26
Positron emission tomography (PET) is a functional imaging tool for various 27 medical applications, such as oncology, cardiology, and neurology [1]. It has a unique 28 ability to quantitatively estimate radiotracer concentrations as low as picomolar 29 concentrations; however, the radiotracer concentration cannot be directly imaged from a line of response measured by the coincidence detection of annihilation photons. 31 Therefore, an image reconstruction process is required to estimate the distribution of 32

77
where and ∆ denote the speed of light and TOF information, respectively. The 78 MLAP method simply accumulates the event in the voxel nearest to ⃗ . The CW 79 method accumulates the event as a line weighted by the TOF response function centered 80 on ⃗ , as illustrated in Figure 1. 81 By mathematically modeling the variance of the voxel value of an analytically 82 reconstructed image of a uniform disk phantom, it was concluded that the CW method 83 is optimal in terms of SNR [23,24]. The MLAP method is not optimal because the high-84 resolution information in the vertical direction (Figure 1 (b)) of the coincidence event is 85 lost after accumulating the events into the histo-image, using the MLAP method. In 86 other words, the direction of the coincidence event contains information about 87 resolution heterogeneity. Another reason to choose the CW method is continuity with 88 conventional non-TOF-PET image reconstruction. In other words, the CW method 89 tends to the non-TOF-PET image reconstruction method, as the coincidence time 90 resolution (CTR) increases to infinity. 91

92
In this study, we adopted an angular-view grouping [22,25] to introduce directional 93 information in the TOF-PET image reconstruction using CNN.
Although the CW method is optimal from the perspective of SNR, it is time 96 consuming because it requires ray tracing. In this study, we introduce angular-view 97 grouping [22,25] to implement direction information. In this scheme, the events are 98 divided into N groups, depending on the angle of coincidence. The group of events is 99 separately accumulated in N histo-images using MLAP methods. Figure 2 shows an 100 example of angular-view grouping at N = 8. Using angular-view grouping, we can 101 preserve the direction information of coincidence events as view-grouped histo-images 102 without ray tracing. Note that the angular-view grouping in this study is performed in 103 the azimuthal angle, and not in the oblique angle. 104 The azimuthal angle and view group can be calculated with following 105 equations: 106 where ( 1 , 1 ), ( 2 , 2 ) are 2D coordinates of ⃗ 1 and ⃗ 2 , respectively, and ⌈•⌉ 109 denotes an operator of the round down. The encoder part extracts useful features for image reconstruction through 127 convolution, non-linear activation, and down-sampling. The combination of 3 × 3 × 3, 128 3D convolution, and a leaky rectified linear unit (LReLU) is repeated twice before 129 down-sampling. Down-sampling is performed by a 3 × 4 × 4, 3D convolution with 130 stride (1, 2, 2), followed by LReLU. At each down-sampling stage, the x and y sizes of 131 the feature maps are halved, and the number of channels is doubled. 132 The decoder part reconstructs the final image from the feature maps through 133 convolution, nonlinear activation, and up-sampling. The combination of 3 × 3 × 3, 3D performed by a 3 × 4 × 4, 3D transpose convolution with stride (1, 2, 2), followed by 136 LReLU. At each up-sampling stage, the x and y sizes of the feature maps are doubled, 137 and the number of channels is halved. The final image is reconstructed by a 3 × 3 × 3, 138 3D convolution with one channel output. 139 The feature maps of the encoder part before down-sampling are added to the feature 140 maps of the decoder part after up-sampling through a skip connection. 141 In this study, we used the brain-dedicated PET scanner described in [28] as the 150 detector arrangement for the simulation. A detector ring with a diameter of 486.83 mm 151 was constructed with detector units of 28 and 4 in the ring and axial directions, 152 respectively. Each detector unit had a 16 × 16 array of cerium-doped lutetium-yttrium 153 oxyorthosilicate (LYSO) crystals. The size of each LYSO crystal was 3.14 mm × 3.14 154 mm × 20 mm. The image size was 70 × 128 × 128 voxels with a voxel size of 3.221 × 3 155 × 3 mm 3 . An energy resolution of 15% and an energy window of 400-650 keV were 156 assumed. A total of 181.12 ± 6.08 M counts, including scatter events, were collected for each subject using 3D acquisition. CTR values of 100, 300, and 600 ps were simulated. 158

Experimental setup
The number of ring was 72, including the gap between the detector units in the axial 159 direction. The maximum ring difference was set to ±66. 160 We split the 20 subjects into 15 and 5 for training and testing, respectively. In 161 addition, the training data were split into 12 for real training and 3 for validation to 162 monitor the validation loss during training. 163

Network training 164
We trained the 3D CNN for 500 epochs using the Adam optimizer with 1 = 0.5. 165 In this study, the phantom images were used as the training label. The mean squared 166 error was used as the loss function. We considered 64 updates using the mini-batch with  where is the standard deviation of the TOF response function, and 0 denotes 206 the required spatial resolution. If = 6.37 mm, which corresponds to a CTR of 100 207 ps, and 0 = 4.5 mm, which corresponds to a 1.5 voxel width, then N > 6 is sufficient. 208 The above results are consistent with this theory. Therefore, the optimal number of 209 views for each TOF-PET scanner can be easily estimated using the above formula. 210 These results indicate that the spatial resolution of the reconstructed image by the 217 proposed method was improved by the high-resolution information in the vertical 218 direction of coincidence events. 219 Table 1 shows a comparison of calculation times, PSNRs, and SSIMs between the 220 proposed method and the other methods for a CTR of 300 ps. The calculation time of 221 the proposed method is almost the same as that of FastPET. In addition, the proposed 222 method was three orders of magnitude faster than List-DRAMA. These results 223 indicate that the proposed method has the capability of near real-time TOF-PET 224 image reconstruction with high image quality. 225 One of the limitations of this study is that it is applied only for the simulation 226 dataset. We will collect experimental data for training neural networks using the brain 227 PET scanner. From this study, it seems that the findings of theory of image 228 reconstruction are also useful for deep learning-based methods. For example, view-229 grouped histo-images could be beneficial in improving PET image quality using 230 unsupervised CNN framework [6-8,12]. 231 In this study, we used the angular-view grouping with MLAP instead of CW. The 232 CW method is expected to improve the SNR from the principle of TOF-PET image 233 reconstruction, however, it is impractical from the perspective of calculation cost 234 because the strict calculation of CW requires ray-tracing event-by-event. The angular-235 view grouping can be considered as the fast approximation of the CW method, and its 236 performance is similar to that of the CW method as the number of views increases. In 237 this study, we input the tuple of view-grouped histo-images and attenuation maps to 238 the 3D CNN similar to FastPET [21]. When no attenuation map is used for the CNN 239 input, the accuracies of scatter and attenuation corrections could be degraded because 240 the Compton scattering and photon attenuation are governed by the attenuation map. 241

245
We proposed a deep learning-based fast TOF-PET image reconstruction using 246 direction information. We input the view-grouped histo-images to the 3D CNN to use 247 the direction information. We evaluated the proposed method using Monte Carlo 248 simulation data from a digital brain phantom. The proposed method achieved better 249 PSNR and SSIM results, recovered finer structures than the other methods, and required 250 a sub-second calculation time. These results indicate that the proposed method is 251 beneficial in both the speed and image quality of TOF-PET image reconstruction. 252