A. Data Collection and Generation of Paired Topograms
The publicly available Lung Image Data Consortium (LIDC) dataset includes images from 1050 helical thoracic CT examinations that is compiled from seven academic centers and eight different CT vendors24. Each exam contains both volumetric pixel information as well as relevant scan parameters (e.g., tube current and tube voltage). However, the database includes topograms for only a very limited number of cases, and typically only for a single view, i.e., sagittal or coronal. To accurately train deep neural networks that generate volumetric images from input 2D topograms, a large training dataset of paired 2D-3D data is required. For this, we implemented a dedicated synthetic topogram creation process25. The process, which is depicted in Fig. 2, involves simulating multiple beams that travers through clinical CT volumes from a fixed x-ray point source. For each ray starting at the predefined point source location and entrance voxel, the next point of intersection is defined as the closest voxel border to the path of the ray. The process is repeated with the point of intersection representing the new entrance point, until the ray exits the volume. For each ray, the products of distance between entrance and exit point and the voxel attenuation are aggregated. Topogram intensities are then synthesized on a simulated \(1024\times 1024\) pixel x-ray detector by utilizing the Beer-Lambert Law of x-ray attenuation5:
$$I\left({x}_{n}\right)=I\left({x}_{0}\right){e}^{-\sum _{i=0}^{n}A\left(i\right){\Delta }{x}_{i}} \left(1\right)$$
B. Encoder-Decoder Neural Networks
Our formulated solution to the 2D-to-3D mapping problem is defined through a modified encoder-decoder neural network architecture with an embedded transformation subnetwork that is based on the autoencoder architecture26 (Fig. 2). Consider an autoencoder \(A\) with input \(X\) and output \(Y\) such that \(A\left(X\right)=Y\). Through stochastic backpropagation and a predefined loss function \(L(X, Y)\), the autoencoder iteratively updates the parameters of the neural network so that, given a representative input from the input dataset, the neural network may produce a near-identical output. We developed two architectures of encoder-decoder models which follow a similar methodology, one for a single-view input and one for a dual-view (stereo) input. We consider the problem of volumetric CT mapping from single or double 2D projection as a traditional image translation task with an additional transformation layer to increase image dimensionality. Given a coronal and/or sagittal topogram projections \({X}_{1}\) and \({X}_{2}\), the goal of the neural network is to generate a predicted output volume \({Y}_{pred}\) such that \({Y}_{pred}= {Y}_{truth}\), where \({Y}_{truth}\) is the ground-truth clinical CT volume. We therefore define two deep learning mapping functions \({F}_{1}\) and \({F}_{2}\), such that \({F}_{1}\left({X}_{1}\right)={Y}_{pred}\) and \({F}_{2}({X}_{1}, {X}_{2})={Y}_{pred}\). Via stochastic gradient descent and convolutional backpropagation, we iteratively update the model weights of \({F}_{1}\) and \({F}_{2}\) according to the ground-truth CT volume \({Y}_{truth}\) through a loss function \(L({Y}_{pred}, {Y}_{truth})\).
C. Representation Subnetwork
The representation subnetwork (shown with yellow background in Figs. 1 and 3) is tasked with reducing the dimensionality of the input 2D topogram or topograms. Given a single-view input topogram \({X}_{1}\), or two dual-view topograms \({X}_{1}\) and \({X}_{2}\), a representation subnetwork is trained to generate a single output latent tensor \(L\) for each of the input topograms. The data flow of the subnetwork in both the single-view and dual-view neural networks is given as: \(1024\times 1024\times 1 \to 1024\times 1024\times 32 \to 512\times 512\times 64 \to 256\times 256\times 128 \to 128\times 128\times 256 \to 64\times 64\times
512 \to 32\times 32\times 1024 \to 8\times 8\times 4096 \to 4\times 4\times 4096\) with each ’\(\to\)’ representing a convolutional block with batch normalization and the Rectified Linear Unit (ReLU) activation function27. We chose the ReLU activation function over alternatives due to its faster convergence, and its tendency to resolve issues such as vanishing or exploding gradient, as found during initial tests.
D. Transformation Subnetwork
The transformation subnetwork (shown with red background in Figs. 1 and 3) is tasked with combining the latent tensor representations of each radiographic projection and increasing their dimensionality. Considering transformation subnetwork \(T\), latent tensor \(L\), and reshaped latent tensor \(Z\), the single-view subnetwork is invoked such that \(T\left(L\right)=Z\). In the dual-view architecture, both previously generated latent tensors \({L}_{1}\) and \({L}_{2}\) are concatenated to form a higher-dimensionality latent tensor \(Z\). Hence, we denote the operation as \(T\left(L1, L2\right)=Z\). In both variants of the transformation subnetwork, a single convolution with kernel size \(1\times 1\times 1\) is invoked to learn the new reshaped spatial hierarchies.
E. Generation Subnetwork
The generation subnetwork is the final component of the developed set of encoder-decoder neural networks and is tasked with enlarging the reshaped latent tensor \(Z\) into a final volumetric output. Considering the reshaped latent tensor \(Z\) and final output CT volume \({Y}_{pred}\), the generation subnetwork, abstracted as \(G\), is invoked such that \(G\left(Z\right)={Y}_{pred}\). The data flow of the hidden convolutional layers in the generation subnetwork is given as: \(4\times 4\times 4\times 1024 \to 8\times 8\times 8\times 512 \to 16\times 16\times 16\times 256 \to 32\times 32\times 32\
times 128 \to 64\times 64\times 64\times 64 \to 128\times 128\times 512 \to 128\times 128\times 128\times 1\), with each ’\(\to\)’ representing a deconvolutional block with the ReLU activation function.
F. Network Training
To determine the accuracy of both single-view and dual-view architectures, both neural networks are trained on the paired synthesized topogram and volumetric CT dataset. For both models, an initial learning rate of 0.0002 on the Adam optimizer is used to minimize the mean-squared-error (MSE) loss function \(L({Y}_{pred}, {Y}_{truth})\) via stochastic gradient descent and backpropagation. All training was conducted on two SLI-connected NVIDIA Tesla P100 GPUs, each with 16 GB of VRAM. Model weights are saved locally every 10 epochs, and the training process automatically terminates after convergence. The weights with the lowest average loss are preserved and serialized.
G. Evaluation Metrics
For quantitative evaluation, four common image similarity metrics are calculated: mean-squared error (MSE), mean average error (MAE), structural similarity (SSIM), peak signal-to-noise ratio (PSNR). Additionally, accuracy of lung/tissue segmentations that are based on thresholding the volumes produced by both neural networks and the original (ground-truth) CT volume is used as an application-focused metric with DICE score calculations28,29. Finally, we adopt a methodology to determine the quality of the recovered volumes as input for dose modulation techniques. For this, we first remove extraneous objects from the CT volume through thresholding and connected component labeling30. Next, we calculate the water equivalent diameter (\({D}_{w}\)) of each slice on the ground-truth and estimate volume slices by:
$${D}_{w}=2\sqrt{\frac{{A}_{w}}{\pi }} where {A}_{w}={A}_{pixel}\times \sum \left(\frac{\mu \left(x,y\right)}{{\mu }_{water}}\right)={A}_{pixel}\times \sum \left(\frac{CT\#\left(x,y\right)}{1000}\right) \left(2\right)$$
where \(CT\#\left(x,y\right)\) represents the water-normalized attenuation of a voxel situated at coordinates \((x, y)\), \({A}_{pixel}\) represents the area of a pixel, and \({A}_{w}\) represents the water-equivalent area of the given CT slice. This method with its underlying approximations was demonstrated to be valid in Wang et al. with both analytical and Monte Carlo methods31. Similar calculations and are also included as part of the AAPM task group 220 Report14.
In addition to these quantitative image similarity metrics, qualitative analysis was performed on typical generated samples.