Performance evaluation in the reconstruction of 2D images of computed tomography using massively parallel programming CUDA

Analysis of processing time and similarity of images generated between CPU and GPU architectures and sequential and parallel programming. For image processing a computer with AMD FX-8350 processor and an Nvidia GTX 960 Maxwell GPU was used, along with the CUDAFY library and the programming language C\# with the IDE Visual studio. The results of the comparisons indicate that the form of sequential programming in a CPU generates reliable images at a high custom of time when compared to the forms of parallel programming in CPU and GPU. While parallel programming generates faster results, but with increased noise in the reconstructed image. For data types float a GPU obtained best result with average time equivalent to 1/3 of the processor, however the data is of type double the parallel CPU approach obtained the best performance. For the float data type, the GPU had the best average time performance, while for the double data type the best average time performance was for the parallel approach CPU. Regarding image quality, the sequential approach obtained similar outputs, while the parallel approaches generated noise in their outputs.

to create non-invasive and highly accurate examinations for patients. In 1972, engineer Godfrey Hounsfield created a computed tomography scanner. He believed that there could be more information on radiographs that could be captured with the film.
Computed Tomography is a high precision non invasive imaging exam, providing a better quality of patient care. CT scan uses x-rays to make the diagnoses and the images are formed through the attenuations that the x-rays suffer when crossing a certain body [18] [19].
One of the ways to create 3D computed tomography images is through the interpolation of a stack of 2D images, which requires processing time to reconstruct each image [21].
The present study aims to find the method with the highest performance in the delivery of computed tomography exam results, which benefits the patient with a shorter waiting time. It is noteworthy that there was a generation of noise arising from the techniques applied to reduce the time in the reconstruction of 2D images.
In this way, this study is delimited in the process of 2D image reconstruction, focusing on the comparison between the time of image reconstruction and analysis of the quality of these images, through three approaches which will be presented in the methodology section.

Related Works
[18] demonstrates in his work that the Filtered Backprojection algorithm for reconstruction of 2D tomography images can be massively parallelized and presents a benchmark between an Intel Paragon supercomputer and a Conection Machine CM-5 with dataset based on one image of CT. The algorithm was analyzed for effi-arXiv:2109.02174v1 [physics.med-ph] 5 Sep 2021 ciency and speed in Intel Paragon and CM-5. The execution times obtained from the parallelization indicate that at least in the 2D case, overall Intel Paragon delivered better acceleration and efficiency results than the CM-5.
In [19], the reconstruction of 2D tomography images demonstrated necessitating a massive parallelization of the algorithm for performance gain. They presented a hybrid approach proposal with parallelization in both CPU and GPU with dataset based on cucumber phantoms. The results demonstrate that the GPU 980 Maxwell used in the study obtained a gain in performance of approximately 5 times, demonstrating the possibility of using GPUs for the reconstruction of electrical impedance tomography images.
In [17], the reconstruction of 3D images of Tomography using a massively parallel aproach in GPU and sequential approach in a CPU, through the 2D synograms of three different CT images. In the study, is presented two different NVIDIA GPU, Tesla and Fermi architectures were used, which analyzed the quality of the generated images and the performance. It was noticed an acceleration factor between 15 and 85 times when compared to the sequential CPU approach. It was compared the Fermi and Tesla GPUs as well as the quality of the generated images, perceiving noise increment in the images generated by the massively parallel approaches, due to the problem of running in the process of writing-modification-reading of the threads, being necessary the use of computational atomic operations.

Computed tomography
Computed tomography is performed using the Johann Radon algorithm or Radon transform, also known as sinograms, which aims to reconstruct images of sections or slices of a body from measurements of attenuations that the x-ray suffered when crossing the body at a certain angle θ, as illustrated in [ [13], Figure 1   The law that studies the physical process of the interaction of an x-ray beam with matter is the Lambert Beer law according to [11] and is seen in 1.
Where I is the intensity of the beam after passing through the body; I 0 is the initial beam intensity and (-µ· d ) is the product of the attenuation coefficient (µ) with the thickness of the body (d ) hit by the beam, as illustrated in [ [14], Figure 2 (adapted)]. Thus, when performing a CT scan, it is possible to obtain a matrix with the attenuation values of the x-ray beams in a two-dimensional space f (x, y), representing slices of an object.
Knowing that the x-ray travels in a straight line as can be observed in the [[8], Figure 3], where each line is in function of f(x,y) and integrated by the parameters (θ, t), in this way each beam of x-ray can be described by equation (x · cos(θ) + y · sin(θ) = t).
Applying the filtered backpropagation (FBP) algorithm and the inverse Radon transform, the equation 2 is obtained.
, representing the filter convolution over the original projection. O algoritmo de reconstrução das imagens 2D de tomografia computadorizada se utiliza desta equação.

Benchmark
Computed tomography images were created from data obtained through the website of the Technical University of Denmark 1 , these data are part of the project The Visible Human 2 of the National Library of Medicine of the United States and are composed of files with numerical data referring to the attenuations that the x-ray suffered during the examination.
In this study we used a computer with AMD 3 FX-8350 processor with 8 cores and 4.0 GHz, 8GB of memory DDR3 kingston 4 HyperX Fury with 1600Mhz, a SSD kingston UV400 240GB and an NVIDIA GTX 960 video card integrated with Maxwell architecture GPU's, 1024 CUDA cores and 4GB of GDDR5 memory.
The study took into account three main approaches wich are sequential CPU, CPU with thread utilization and GPU accelerated computing using CUDA technology. In this way, the study analyze which approach has better performance without loss of image quality, were used in reconstructions of images data types of simple precision float 32 bits and double 64 bits precision.
We have reconstructed 13 different images, each image was reconstructed 10 times, adding 130 reconstructed images per approach, each approach was performed 5 times, thus adding a total of 1950 reconstructions per data type. thus obtaining a total of 3900 image reconstructions for this study.
The images were analyzed using the Peak Signal to Noise Ratio (PSNR) and Structural Similarity Index (SSIM) algorithms to ensure the quality of the generated images. 3 www.amd.com/pt/products/cpu/fx-8350 4 www.kingston.com/us/gaming/hyperx-fury-ddr3 According to [4], PSNR is an estimate of the image reconstructed with the original through the pixel differences, its algorithm is given by 3.
In the PSNR algorithm, the greater the result given in decibels (dB), the more similar the images are, if the result is indeterminate or ∞, it means that the images are the same because they obtained zero in the denominator.
In agreement with [5], SSIM is an algorithm used to verify the similarity of images from loss of luminance, correlation, distortion and contrast distortion, its is seen in 4.
For SSIM, if the result is 1, the images are the same, if it is close to 1, the images are similar.

Results and discussion
The images that make up the dataset used in the bechmark are illustrated in Figure 4. After executing the bechmark, the results obtained with the float data types can be observed as illustrated in the graph of Figure 5.
The average GPU time in the reconstruction of a 2D image was 0.52 seconds, the CPU reconstruction with threads was 1.37 seconds and the sequential CPU was 4.69 seconds. In this way, the performance of the GPU in the reconstruction of the images of computed tomography 2D with the use of the type float data is highlighted.
In discussion of the results found regarding time in float data type, for the presented hardware configuration, the GPU obtained an average of approximately 1 Fig. 5 Graphics with the performance of 2D image reconstructions using the 32-bit float data type.
of the parallel CPU time, due to the number of cores of CUDA on the GPU, compared to the number of CPU cores. A similar fact occurs when comparing theads CPU approaches with the sequential CPU, which has a determining factor in the number of cores used in processing, with the performance of the CPU threads being less than 1 3 of the sequential CPU time. Referring to the quality of the images, it can be presented from the tables 1 and 2 that the images are similar in each approach.
The Table 1 demonstrates the similarity of the images generated between the approaches, using the float data type, through the PSNR image quality analysis algorithm. The Table 2 demonstrates the similarity of the images generated between the approaches, using the float data type, through the SSIM image quality analysis algorithm. Regarding the results obtained with the double data types, they can be observed by means of the graphic of the Figure 6. The average CPU time with threads in this approach was 1.66 seconds, a GPU got 1.93 seconds, and the sequential CPU averaged 4.57 seconds for better performance compared to a float version. Regarding GPU performance with double data type, according to [6], a bandwidth, an increase in the number of bytes to read from 4 to 8, and architecture of CUDA processors, influence GPU performance in this way , obtaining higher values than the CPU with Thread.
In this experiment, the CPU with Threads obtained better performance in the average time compared to the GPU, this result being given by the fact that the GPU Maxwell architecture prioritizes the float data type, obtaining a better output and getting more gigaflops compared to the double data type [22].
Regarding the quality of the generated images, it is possible to perceive through the tables 3 and 4 that the images are similar.
The table 3, demonstrates the similarity of the generated images between the approaches, using the double data type, through the PSNR image quality analysis algorithm. The table 4, demonstrates the similarity of the images generated between the approaches, using the double data type, through the SSIM image quality analysis algorithm. It is noteworthy that with each new reconstruction, a different value of PSNR and SSIM can be obtained, as discussed in the work of [17], the authors define that this is due to the process of writing-modification-reading of the threads, being necessary the use of computational atomic operations to eliminate noise, but causing loss of performance. This does not happen when performed in the sequential CPU approach, where the results obtained by PSNR and SSIM demonstrate that the images are the same.

Conclusion
According to the results obtained, it is possible to conclude that the images generated in each approach are similar above 99%, but when performing the interpolation of a 2D batch of tomography images, unreliable results can be generated in the 3D volumetric images, given to the accumulation of noise generated both in the reconstruction and in the interpolation of the image.
Regarding the double data type, in order to guarantee better performance, it is necessary to make a prior analysis in the architecture of the hardware used before guaranteeing which methodology performs better.
It is also concluded that the GPU has better performance in the float data type when compared to a conventional CPU and that the CPU using threads obtained better performance in the double data type. Thus, the performance item depends on both the data type and the CPU/GPU architecture used in the experiment.
As a continuation of this research, we intend to develop the reconstruction of 3D images of computed tomography using the algorithm of Feldkamp, Davis and Kress (FDK) according to [20], which uses the filtered backprojection algorithm in the reconstruction of threedimensional images obtained by projections of a cone beam of x-ray. To accelerate the reconstruction of 3D images, we intend to use a massively parallel programming with CUDA and after the reconstructions, to perform an analysis of the noise added in the images in comparison to a sequential programming in the CPU, it is worth mentioning that a 3D image is obtained from of multiple 2D images.