Convolutional Encoder-Decoder Networks for Volumetric Computed Tomography Surviews from Single- and Dual-View Topograms

Computed tomography (CT) is an extensively used imaging modality capable of generating detailed images of a patient’s internal anatomy for diagnostic and interventional procedures. High-resolution volumes are created by measuring and combining information along many radiographic projection angles. In current medical practice, single and dual-view two-dimensional (2D) topograms are utilized for planning the proceeding diagnostic scans and for selecting favorable acquisition parameters, either manually or automatically, as well as for dose modulation calculations. In this study, we develop modified 2D to three-dimensional (3D) encoder-decoder neural network architectures to generate CT-like volumes from single and dual-view topograms. We validate the developed neural networks on synthesized topograms from publicly available thoracic CT datasets. Finally, we assess the viability of the proposed transformational encoder-decoder architecture on both common image similarity metrics and quantitative clinical use case metrics, a first for 2D-to-3D CT reconstruction research. According to our findings, both single-input and dual-input neural networks are able to provide accurate volumetric anatomical estimates. The proposed technology will allow for improved (i) planning of diagnostic CT acquisitions, (ii) input for various dose modulation techniques, and (iii) recommendations for acquisition parameters and/or automatic parameter selection. It may also provide for an accurate attenuation correction map for positron emission tomography (PET) with only a small fraction of the radiation dose utilized.


I. Introduction
X-ray computed tomography (CT) is an extensively used imaging modality that captures three-dimensional (3D) anatomical structures, primarily for diagnostic applications 1 . Clinical CT systems rely on measuring individual x-ray projections in axial or helical acquisition modes while rotating around the patient body, eliminating spatial disadvantages of prior two-dimensional (2D) radiographic x-ray modalities 2,3 . Through ltered back-projection (FBP) or state-of-the-art reconstruction techniques, such as iterative reconstruction (IR) or arti cial intelligence (AI), modern CT scanners can produce high-resolution volumetric images of a patient's anatomy 4 . However, by measuring hundreds of projections at varying angles, CT scans involve ionizing radiation exposure to patients, raising numerous health concerns such as damage to DNA and the associated increased risk of cancer [5][6][7][8] . Dose modulation techniques [9][10][11] and manual or automatic kVp selection 12,13 are examples of modern CT imaging methods that provide means to signi cantly reduce the amount of radiation required to generate datasets of diagnostic image quality. Dose modulation allows a reduction of radiation dose levels by adjusting the x-ray ux during a scan through current modulations of the accelerated electrons that bombard the anode. Such current modulations are based on patient-speci c information that is derived from either coronal or sagittal 2D topograms, or both. Such 2D topograms, referred to by different CT vendors as scouts, surviews, or tomograms, provide input for both planning of the diagnostic scan and for estimating water-equivalent diameters (WED) that represent the patient size in every cross-section and serve as a basis for actively reducing/increasing the radiation dose in large/small cross sections 14 . Planning topograms are also being used to select the accelerating electron voltage (kVp), which can be determined either manually or automatically, in order to control the quality of the beam and ensure that a large proration of x-ray photons is utilized for imaging, e.g., higher kVp values for larger patient sizes 12,13,15 . Information in 3D-surview volumes that were generated from single-or dual-view topograms has the potential to improve current and future dose modulation techniques and acquisition parameter selections for reduced patient risk. For example, dynamic bowtie lter concepts for angle-dependent x-ray ux modulation [15][16][17] would also greatly bene t from the additional information contained within 3D-surviews. Finally, 3D-surviews could enable improved planning of proceeding diagnostic scans, e.g., through better organ visualization.
In recent years the appeal of deep learning and AI in the medical imaging domain has inspired several studies investigating deep-learning facilitated CT reconstruction 18-20 . Notable studies involve transforming and denoising low-dose CT scans with convolutional autoencoders and convolutional neural network (CNN) facilitated limited-angle reconstruction [21][22][23] . Within the speci c sub eld of 2D-to-3D CT mapping, Shen et al. investigate deep learning methods for generating volumetric datasets from single-view radiographic projections 22 . However, few studies investigate the advantages of utilizing stereo 2D-to-3D CT. Katsen et al. developed a CNN to generate 3D knee bone segmentations from pairs of 2D knee radiographs 23 . However, the challenge of generating volumetric datasets for additional body sections, e.g., for thoracic CT scans, from more than one clinically obtainable projection remains.
In this study, we develop and implement single-view and dual-view CNNs for volumetric CT mapping from 2D topogram projections. Both neural networks follow a modi ed encoder-decoder architecture, as shown in Fig. 1, and are comprised of three individual convolutional subnetworks: a feature representation subnetwork, a latent transformation subnetwork, and a generation subnetwork. To determine the theoretical viability of these encoder-decoder architectures, we train and assess their performance on publicly available CT datasets from the Lung Image Data Consortium dataset 24 . Finally, we assess the clinical viability of the two neural networks with use case-speci c metrics for common CT applications, including dose modulation techniques and improved clinical scan planning that is strengthened by superior anatomy-awareness.

A. Data Collection and Generation of Paired Topograms
The publicly available Lung Image Data Consortium (LIDC) dataset includes images from 1050 helical thoracic CT examinations that is compiled from seven academic centers and eight different CT vendors 24 . Each exam contains both volumetric pixel information as well as relevant scan parameters (e.g., tube current and tube voltage). However, the database includes topograms for only a very limited number of cases, and typically only for a single view, i.e., sagittal or coronal. To accurately train deep neural networks that generate volumetric images from input 2D topograms, a large training dataset of paired 2D-3D data is required. For this, we implemented a dedicated synthetic topogram creation process 25 . The process, which is depicted in Fig. 2, involves simulating multiple beams that travers through clinical CT volumes from a xed x-ray point source. For each ray starting at the prede ned point source location and entrance voxel, the next point of intersection is de ned as the closest voxel border to the path of the ray. The process is repeated with the point of intersection representing the new entrance point, until the ray exits the volume. For each ray, the products of distance between entrance and exit point and the voxel attenuation are aggregated. Topogram intensities are then synthesized on a simulated pixel x-ray detector by utilizing the Beer-Lambert Law of x-ray attenuation 5 :

B. Encoder-Decoder Neural Networks
Our formulated solution to the 2D-to-3D mapping problem is de ned through a modi ed encoder-decoder neural network architecture with an embedded transformation subnetwork that is based on the autoencoder architecture 26 (Fig. 2). Consider an autoencoder with input and output such that . Through stochastic backpropagation and a prede ned loss function , the autoencoder iteratively updates the parameters of the neural network so that, given a representative input from the input dataset, the neural network may produce a near-identical output. We developed two architectures of encoder-decoder models which follow a similar methodology, one for a single-view input and one for a dual-view (stereo) input. We consider the problem of volumetric CT mapping from single or double 2D projection as a traditional image translation task with an additional transformation layer to increase image dimensionality. Given a coronal and/or sagittal topogram projections and , the goal of the neural network is to generate a predicted output volume such that , where is the ground-truth clinical CT volume. We therefore de ne two deep learning mapping functions and , such that and . Via stochastic gradient descent and convolutional backpropagation, we iteratively update the model weights of and according to the ground-truth CT volume through a loss function .

C. Representation Subnetwork
The representation subnetwork (shown with yellow background in Figs. 1 and 3) is tasked with reducing the dimensionality of the input 2D topogram or topograms. Given a single-view input topogram , or two dual-view topograms and , a representation subnetwork is trained to generate a single output latent tensor for each of the input topograms. The data ow of the subnetwork in both the single-view and dual-view neural networks is given as: with each ' ' representing a convolutional block with batch normalization and the Recti ed Linear Unit (ReLU) activation function 27 . We chose the ReLU activation function over alternatives due to its faster convergence, and its tendency to resolve issues such as vanishing or exploding gradient, as found during initial tests.

D. Transformation Subnetwork
The transformation subnetwork (shown with red background in Figs. 1 and 3) is tasked with combining the latent tensor representations of each radiographic projection and increasing their dimensionality. Considering transformation subnetwork , latent tensor , and reshaped latent tensor , the single-view subnetwork is invoked such that . In the dual-view architecture, both previously generated latent tensors and are concatenated to form a higher-dimensionality latent tensor . Hence, we denote the operation as . In both variants of the transformation subnetwork, a single convolution with kernel size is invoked to learn the new reshaped spatial hierarchies.

E. Generation Subnetwork
The generation subnetwork is the nal component of the developed set of encoder-decoder neural networks and is tasked with enlarging the reshaped latent tensor into a nal volumetric output. Considering the reshaped latent tensor and nal output CT volume , the generation subnetwork, abstracted as , is invoked such that . The data ow of the hidden convolutional layers in the generation subnetwork is given as: , with each ' ' representing a deconvolutional block with the ReLU activation function.

F. Network Training
To determine the accuracy of both single-view and dual-view architectures, both neural networks are trained on the paired synthesized topogram and volumetric CT dataset. For both models, an initial learning rate of 0.0002 on the Adam optimizer is used to minimize the mean-squared-error (MSE) loss function via stochastic gradient descent and backpropagation. All training was conducted on two SLI-connected NVIDIA Tesla P100 GPUs, each with 16 GB of VRAM. Model weights are saved locally every 10 epochs, and the training process automatically terminates after convergence. The weights with the lowest average loss are preserved and serialized.

G. Evaluation Metrics
For quantitative evaluation, four common image similarity metrics are calculated: mean-squared error (MSE), mean average error (MAE), structural similarity (SSIM), peak signal-to-noise ratio (PSNR). Additionally, accuracy of lung/tissue segmentations that are based on thresholding the volumes produced by both neural networks and the original (ground-truth) CT volume is used as an application-focused metric with DICE score calculations 28, 29 . Finally, we adopt a methodology to determine the quality of the recovered volumes as input for dose modulation techniques. For this, we rst remove extraneous objects from the CT volume through thresholding and connected component labeling 30 . Next, we calculate the water equivalent diameter ( ) of each slice on the groundtruth and estimate volume slices by: where represents the water-normalized attenuation of a voxel situated at coordinates , represents the area of a pixel, and represents the water-equivalent area of the given CT slice. This method with its underlying approximations was demonstrated to be valid in Wang et al. with both analytical and Monte Carlo methods 31 . Similar calculations and are also included as part of the AAPM task group 220 Report 14 .
In addition to these quantitative image similarity metrics, qualitative analysis was performed on typical generated samples.

Iii. Results
The results reported in the section were averaged over all 60 test dataset volumes, i.e., CT cases that were never seen by either of the neural networks, to determine the e cacy of the proposed deep learning framework for dose modulation and patient size estimation tasks.
Figures 4 and 5 display example results of the two neural networks as compared to their corresponding ground-truth CT datasets. Error maps of intensity difference between original CT volumes, representing ground-truth intensities, and 3D volumes based on 2D inputs demonstrate considerably lower errors for volumes generated from dual-view inputs compared to those generated from single-view inputs. These observations are consistent with the lower MAE and MSE values of the dual-view neural network, as given in Table 1 While soft tissue anatomies are recovered accurately, we note a generally poor recovery of contrast and details of boney tissue that is most evident in the spine. That said, the dual-view neural network does demonstrate more promise in recovering both soft and boney tissues. The lungs in a CT volume can be (naïvely) segmented by thresholding and applying connected component analysis 30 Table 1). The high-delity lung/tissue-segmentations can also be perceived from true-positive (TP) / false-positive (FP) / false-negative (FN) error maps for both models, where the volumes generated from dual-view neural network present fewer false-negative and false-positive voxels than the volumes generated from the single-view neural network.
Examples of patient sizes estimations, as represented by water equivalent areas ( distributions along the patient axis, are displayed in Fig. 6. Note the similarities in mesh geometries between the ground truth representation (a), the single-view neural network representation (b), and the dual-view neural network representation (c). Calculated average water equivalent diameters ( ) resulted in accurate patient size estimations for both neural networks, with the single-view architecture scoring 0.911 and the dual-view architecture scoring 0.925. Based on these results, the two neural networks, and especially the dual-view variant, demonstrate a potential to improve dose modulation techniques by providing accurate volumetric information as input for calculating e cient dose-modulated CT acquisitions.

Iv. Discussion
In this investigation, we develop and re ne two encoder-decoder neural network architectures for recovering volumetric CT data from single-or dual-view radiographic projections. The developed models were trained and tested on paired data of publicly available CT datasets and synthetically generated CT tomograms ( Figure 2). While the generated 3D-surviews do not contain su cient information for diagnostic tasks and are thus not considered as an alternative to volumetric CT datasets, both the single and the dual-view networks perform well on image similarity and use-case speci c metrics. Moreover, our results indicate that, in most cases, the additional information that is available with the dual-view neural network enables improved accuracy as compared to the single-view neural network. Speci cally, from qualitative observations and quantitative image similarity metrics across multiple samples, we acknowledge that the dual-view neural network is generally superior at recovering both the anatomy and soft-tissue/bone intensities than the single-view neural network and is thus more relevant for further clinical translation developments. However, both models perform comparably to the current state-of-the-art in deep learning CT recovery from limited information [21][22][23] , indicating the viability of both architectures in practice.
Based on our qualitative evaluation, the DICE scores from Table 1, and the water-equivalent area and diameter comparisons (Figure 6), we deduce that there are only minor differences between patient size estimations that rely on the complete information available with the (retrospective) diagnostic scan and patient size estimations that rely on information that was extracted by the single or dual-view neural networks. Our results indicate that the developed deep neural networks can provide a promising method with potential applications for automatic acquisition parameter selection 32 , e.g., automatic kVp 12,13 , more e cient dose modulation techniques that utilize improved high-dimensional input, e.g., for various dynamic bowtie lter concepts [15][16][17] , and improved scan planning with better anatomy-awareness. Tube current modulation during diagnostic CT scans is crucial a technique that allows a signi cant reduction of ionizing radiation dose, and thus improved patient safety, while maintaining diagnostic image quality. Conventionally, metrics such as patient size, measured as water-equivalent diameter, or local (along the patient axis) cylindrical or elliptical cross-section representations, also measured in water-equivalent diameter, are used to calculate position/angle-dependent radiation dosages 14 . In addition, calculated patient sizes provide the basis for sizespeci c dose estimates 14 (SSDE). Finally, the additional information may be of su cient resolution to enable su ciently accurate tissue attenuation inputs for the generation of PET attenuation correction maps 33 with a signi cant reduction of ionizing radiation dose levels, i.e., use of single-or dual-view tomograms instead of a complete volumetric scan.
In conclusion, we demonstrate an ability to utilize low-dose single-or dual-view CT topograms, which in the current clinical practice are acquired in almost all CT examinations, for estimating volumetric patient anatomies with high accuracy. Due to the abundance of matched CT-topogram data, such neural networks can be trained and integrated into all existing CT scanners relatively seemingly, i.e., without requiring any hardware changes, to enable acquisition parameter recommendations or automatic selections, a reduction in ionizing radiation dose levels through improved dose modulation calculations, and have a potential to enable PET attenuation correction maps with only a fraction of the currently utilized radiation dose.
Declarations Figure 2 Illustration of the synthetic topogram creation process we applied to generate coronal (left) and sagittal (right) topograms from 1050 publicly available clinical CT volumes. The process involves simulating multiple x-ray beams that traverse through CT volumes from a xed x-ray point source by utilizing the Beer-Lambert Law of attenuation, where nal intensities are measured on a simulated detector with 1024 × 1024 pixels behind the patient body.    An example of volumetric external bodies and volumetric water equivalent areas (A W ) for ground-truth datasets and neural network generated 3D-surview volumes. Displayed are (a) meshes created from the water-equivalent-areas of a ground-truth CT volume, (b) a CT-like 3D-surview generated by the single-view neural network, and (c) a CT-like 3D-surview generated by the dual-view neural network. Our results demonstrate excellent recovery of body size representations for a variety of applications, e.g., radiation dose modulation.