Towards Texture Accurate Slice Interpolation of Medical Images Using PixelMiner

Quantitative analysis models are used for various medical imaging tasks such as registration, classification, object detection, and segmentation which all benefit from high-quality imaging. We propose PixelMiner, a convolution-based deep-learning model which interpolates computed tomography (CT) imaging slices while better preserving quantitative imaging features than standard interpolation methods. PixelMiner was designed to produce texture-accurate slice interpolations by trading off pixel accuracy for texture accuracy using a novel architecture. PixelMiner was trained on a large dataset consisting of 7,829 lung CT scans and validated using external lung CT datasets. We demonstrated the model's effectiveness by using edge preservation ratio (EPR), compared texture features commonly used in radiomics, and performed a qualitative assessment by human trial. PixelMiner had the highest EPR 82% (p<.01) of the time compared to the closest competing method. It had the lowest texture error, using a normalized root mean squared error (NRMSE) of 0.11 (p<.01), with the highest reproducibility of concordance correlation coefficient (CCC) ≥ 0.85 (p<.01). PixelMiner was chosen 72% of the time by human evaluation (p<.01). PixelMiner was not only demonstrated quantitatively to have improved structural and textural constructions but also shown to be preferable qualitatively.


Introduction
Radiological images are an important prognostic tool used to inform medical scientists and medical professionals alike. Radiological imaging is primarily used through qualitative assessment by physicians. However, there have been recent advancements in image acquisition, standardization, and analysis that has enabled the discovery of quantitative biomarkers, for example through radiomics [1]. Quantitative image analysis provides the means not only to diagnose oncological disorders [2][3][4], but also to personalize treatment [5] without the use of invasive procedures.
Quantitative image analysis is dependent on the quality of the image. Image quality is determined by an image's contrast and spatial resolution, as well as the level of noise and the presence of artifacts. The image quality affects the ability to discern different structures and low-frequency signals within the image [6]. A low image quality may therefore have a negative effect on quantitative image analysis [7]. Computed Tomography (CT) scans are acquired and reconstructed with a broad scope of imaging specific parameters which can affect image quality, including differences in reconstruction kernel [8] and slice spacing. Different CT reconstructions methods cause slice spacings to vary considerably creating challenges using predictive analytics. Differences in pixel size and slice spacing have for example been shown to contribute to the stability, robustness, and repeatability of radiomic features [9]. Reproducing features faithfully contributes to better performing object detection and classification models that rely on deep learning to make accurate predictions.
To overcome issues with heterogeneity, CT scans are often interpolated to a common slice spacing. Commonly used interpolation methods include nearest neighbors (NN), trilinear, and tricubic interpolation [10]. Medical scans can be thought of as continuous latent images, implying there is still much uncaptured information between slices that can be predicted. Progressing through the slices various structures move, grow, or rotate at different rates which a model needs to anticipate. Common interpolation models cannot anticipate these structural changes, which cause artifacts to appear on the interpolated slices, such as blurring or ghosting [11]. These artifacts have been shown to lead to errors in image registration [12]. Furthermore, during interpolation high frequency information can be lost [13], and signal reproductions can be too low which leads to distortions [14]. Texture information is represented in a wide range of spatial frequencies which are very important for image classification [15]. These studies suggest it is important to use an algorithm capable of capturing frequency signal information to generate accurate synthetic textures.
The aim of this research is to improve the quality of medical imaging using interpolation to allow for more homogeneous datasets through preprocessing. The proposed method is a deep learning convolutional-based architecture called PixelMiner. PixelMiner miner is an autoregressive model that predicts a slice between two adjacent slices by taking in a three dimensional input including the target slice, where the prediction loss is determined by the difference between the target slice and a single interpolated slice. We hypothesize that our approach has major advantages over other methods as it is autoregressive, and as such it is able to include information about the interpolated slice in its predictions, in contrast to other models that rely solely on surrounding slices. We also hypothesize that because textures can be well summarized by their component spatial frequencies [15], a model designed specifically for processing signals will achieve improved performance. PixelMiner was compared against other existing interpolation methods by comparing the accuracy of texture generation using standard metrics and through radiomics feature analysis.

Data
The dataset used for training is a publicly available dataset from the Radiological society of North America (RSNA) Pulmonary Embolism Detection Challenge © RSNA 2020 [16]. This dataset consists of 7,829 chest CT scans, with pixel spacing and slice spacing of 1mm and 1.25mm, respectively. The dataset's terms and conditions allow the dataset to be shared or re-distributed in any form. For more information see https://www.rsna.org/education/ai-resources-and-training/ai-image-challenge/rsna-pe-detection-chall enge-2020. There were five institutions contributing data, including AlfredHealth, Koç University Hospital, Center for Artificial Intelligence in Medicine & Imaging (AIMI) at Stanford, Unity Health Toronto, and Universidade Federal de São Paulo. All CT scans were retrospectively collected and anonymized at each institution and approved by the respective institutional review boards in accordance with relevant guidelines and regulations.
The dataset used for validation is the publicly available dataset from the Early Lung Cancer Action Program (ELCAP) Public Lung Image Database [17]. This database consists of 50 low-dose CT scans, with slice spacing of 1.25mm. These datasets' terms and conditions allow use for non-commercial purposes only, including academic research and education. All CT scans were collected by ELCAP in collaboration with Weill Cornell Medical College and approved by its institutional review board in accordance with relevant guidelines and regulations.

4
The proposed model, PixelMiner, is a causal autoregressive model based on PixelCNN++ [18,19]. PixelMiner includes many of the features of PixelCNN++, including vertically and horizontally stacked masked convolutions, gated convolutions, and the logistic mixture likelihood loss. The model was trained using batches of three sequential slices as inputs to generate a single output slice. The middle input slice was used to evaluate the output slice so that the model learns to replicate the input slice. The practice of learning to replicate slices of high resolution scans can be used at generation time to interpolate slices in scans with larger slice spacings.
PixelCNN based models were designed to generate RGB images and cannot be adapted to perform slice interpolation due to stability issues, such as warping and artifacts. Its constraint in slice interpolation is its use of 2D masked convolutions. The PixelMiner architecture uses a custom mixed causal and non-causal convolutional block. PixelMiner also uses a combination of 2D and 3D convolutions, where the 3D convolutions are used to learn the relationships between the slices. This architecture creates a pathway for non-masked non-causal filters in the outer channels using only the external slices, while using masked causal filters in the middle. The causal channel receives information from both the middle and external slices. This allows the outer slices to provide complete information to the middle, and prevents the 2D convolutions tendency to ignore the outer slices. A schematic overview of this convolutional block is provided in figure 1A.
PixelMiner further utilizes residual blocks, each of which consists of six downsampling and upsampling layers which utilize long range skip connections. Feature maps are down-sampled using 2 stride PixelMiner convolutions from 64x64 to 32x32 to 16x16, and then upsampled again using transposed convolutions. The final model is made with the input convolutional layer, n residual blocks, and an output convolutional layer using an identity convolution to combine the final output. A schematic depiction of the architecture can be found in figure 1B. Figure 1: Diagram A is a schematic depiction of the PixelMiner convolutional block. Diagram B is the entire architecture where the input goes through a 2D convolution and the output goes through an identity convolutional. The architecture is made up of N residual blocks with two down samples using two stride convolutions and two up samples using transposed convolutions . 6

Generation
The model upsamples a scan using two slices at a time. To begin, the first two sequential slices are selected, a top slice and a bottom slice. An empty slice is then inserted between the top and bottom slices. These three slices are then provided to the model. The model then generates an interpolated slice pixel by pixel until the entire slice is complete, and repeats the process until every slice of the scan is interpolated. Figure 2 provides a graphic example of the completed upsampling technique. Figure 2: A graphic representation of the interpolation technique. In this example, a scan with nine slices with large slice spacings is upsampled with eight additional slices. The dots represent the center of each slice, and dotted lines represent new borders following interpolation, and the arrows represent the generated slices being inserted.
With the resolution of a single slice being 512 x 512 pixels, it is not feasible to use entire slices due to memory limitations. However, because PixelMiner is an autoregressive model, it is capable of generating slices in patches. Therefore, patches of 64 x 64 were used for training and slice interpolation. Patches are generated by beginning at the top left corner, and then expanding out based on the initial patch. Autoressive models require enough observations to capture a signal to make optimal predictions [20]. With this in mind, PixelMiner uses patches of 64 x 64 with incomplete 32 x 32 sub-patches. An example of this process is given in figure 3.

Training and Validation
Training is performed in an unsupervised fashion with the objective being of reproducing the middle input slice without the need for labels. The model receives three randomly selected slices in sequence and outputs a single slice, which is then used together with the middle input slice to calculate the loss. Figure 4 provides an overview of the model training, testing, and validation steps. At generation time for the proposed model, the interpolated synthetic slices and ground truth slices were compared for evaluation. In addition, synthetic slices were generated using other interpolation methods, including (tri-)linear [21,22], (tri-)cubic [23,24], window sinc (WS) [25], and NN [26] interpolation. Linear and BSpline (cubic) interpolation are methods commonly used in radiomic studies [27]. Windowed sinc and NN were done using the open source project the Insight Toolkit or ITK [28] with tools designed specifically for medical imaging. Figure 5 provides examples of interpolated slices.

Edge Preservation Ratio
The edge preservation ratio (EPR) was used to evaluate the accuracy and sharpness of the shapes produced for all interpolation methods. EPR has been shown to outperform common metrics peak signal to noise ratio (PSNR) and structural similarity index (SSIM) in both gaussian blurring tests and human perception [29]. EPR uses a Canny edge detector to produce a binary map of an image and its reference. It then takes the union of pixels between the two images and divides by the total number pixels with highlighted edges from the referenced image. The EPR of each slice was determined and compared across interpolation methods. EPR results were statistically tested using a binomial test to test frequency of higher ratios and a Wilcoxon signed rank test to test pairs of ratios for competing methods on the test dataset.

Texture Features
To test the interpolation methods' ability to retain texture information, gray-level co-occurrence matrix (GLCM), gray-level run length matrix (GLRLM), and gray-level size zone matrix (GLSZM) texture features were extracted. Feature extraction was performed in Python 3.7 with Pyradiomics 3.0.1 [30]. Features were extracted from the synthetic and ground truth slices to act as reference. Masks of regions of interest (ROI) were used for feature extraction to avoid subjectivity. A single image mask was created for the area around the lungs. A simple automated segmentation algorithm was created using thresholding and binary morphology operations to dilate or erode or close openings in the mask using the morphology module in the Scikit-image package [31]. Lungs were segmented using a threshold < -320 Houndsfield units. Using the segmented ROI, Features were extracted and compared for synthetic slices and the ground truth using the NRMSE and concordance correlation coefficient (CCC). The CCC is concordance between paired ground truth values and test values. The statistic quantifies the agreement and reproducibility, shown as: Where µ y and µ x are the means and σ 2 y and σ 2 x are the variances of the reference and tested slice features, respectively, and p is the Pearson correlation coefficient.
The mean squared mapped feature error (MSMFE) is a newly proposed metric. For each texture feature a feature map was generated using a sliding window with a size of 5x5 and step of 1. The RMSE is calculated on the reference ground truth feature map and the feature map of the evaluated scan. The final error is the NRMSE of all texture features. To test for statistical significance, the Wilcoxon signed rank test was used on all paired results. Furthermore, all lowest error rates were tracked and counted to be compared to PixelMiner, which was tested using a binomial test.

Qualitative Assessment: an In Silico Clinical trial
For qualitative assessment of the generated slices we set up an in silico clinical trial. This trial was performed using a newly created application that randomly gives a test subject three slices, of which one is the ground truth and the other two are corresponding synthetically generated slices. The two interpolated slices contained one generated by the PixelMiner model and one generated by another random interpolation method. Users were given 12 randomly selected slices from 50 different scans, which includes each competing interpolation method at least once. For each slice, the users had to choose which synthetic slice better represents the ground truth slice. To show the results were not a statistical anomaly, a binomial test was performed to test statistical significance. 11 Table 1 provides an overview of the EPR values for each tested interpolation method. PixelMiner had higher EPR values for 82% (p < .01) of the scans in the validation dataset compared to the next best performing interpolation method Windowed Sinc. In addition, PixelMiner had the highest mean EPR of 0.489 compared to other methods (p < .01).

Segmentation Extracted Texture Feature Outcomes
PixelMiner had the lowest average NRMSE at 0.109 (p < .01). PixelMiner had a lower NRMSE value for 62.2% (p < .01) of the scans in the validation dataset compared to the next best performing interpolation method Windowed Sinc. PixelMiner also had the highest amount of reproducible features (74.5% with CCC ≥ 0.85), and the highest mean CCC at 89.8. An overview of these metrics for the tested interpolation methods can be found in Table 2.  Table 3 provides an overview of the MSMFE for each texture feature for the different interpolation method. PixelMiner had a lower average MSMFE compared to the other interpolation methods at 0.619 (p < .01). For 68.19% of the features, PixelMiner had the lowest MSMFE compared with the next best interpolation method (p < .01).

Qualitative Assessment
The application was presented to 16 participants. As some participants did not grade all 12 presented slices, a total of 140 comparisons were collected. The participants included medical professionals, radiologists, radiation oncologists, computer scientists, and students, who were tracked during the trial using a custom built application for performing comparison tests. PixelMiner was chosen on average 72% (p < 0.01) of the time, and no other model was chosen as frequently. PixelMiner was chosen over other metrics Nearest Neighbor, BSpline, Windowed Sinc, and Linear 68.2%, 70%, 75%, and 96.3% of the time (p < 0.05), respectively.

Discussion
The goal of this study was to create a model that can perform slice interpolation with a high degree of accuracy compared to other interpolation methods. We have shown that PixelMiner successfully outperforms linear, NN, BSpline, and Windowed Sinc interpolation methods by every metric used to evaluate the methods. PixelMiner has been shown to, on average, have lower error, higher reproducibility, and higher qualitative performance.
PixelMiner had a significantly lower NRMSE for compared to all other models, the lowest for 65.2% of the features compared to the next best method. Furthermore, the reproducibility of PixelMiner using a CCC ≥ 0.85 was shown to be 80% with a mean of 91.42, greater than all other models. Furthermore, the average edge preservation ratio of 0.489 was found to be significantly higher than the other tested methods. If the features for a particular radiomics signature are known it would be worth doing some research beforehand to determine which interpolation method would complement the model being developed first. If the signature is unknown, or a practitioner is using deep learning, the results indicate PixelMiner to be the best choice for slice interpolation.
The approach is quite novel and it is difficult to compare to similar studies. PixelCNNs have been used to perform super resolution on 2d images [32], but to the best of our knowledge it has never been used on 3d images and it's the first time it has been used to generate high fidelity images. Deep learning based super resolution has been an active area of research and includes many models such as SRGAN [33], MFTV [34], FSRCNN [35], SRResNet-V54 [36], LapSRN [37], and multiple dense residual block based GANs [38]. These types of models are designed for 3d images by doing grid based upsampling using transposed convolutions. They are unable to be compared directly with PixelMiner since they would require downsampling in 3 dimensions to be evaluated, causing the models to lose valuable information that slice interpolation is able to retain. These super resolution models could also potentially be used in conjunction with PixelMiner, by upsampling after slices have already been interpolated.
Quantitative analysis in medical imaging relies on high quality images that are harmonious across many types of scanners. There is a potential for quantitative analysis of medical imaging to have a profound effect on prognosis, but it requires high quality data for training machine learning models. Medical imaging slice spacing is only one factor in the overall quality of images but it affects many different areas of quantitative analysis, such as image registration, detection, segmentation, and classification.
A disadvantage to using PixelMiner is that it is slower compared to other interpolation methods. Using a single 1080ti graphical processing unit (GPU), it could take a full 24 hours to interpolate all of the slices of a high resolution 2mm slice spacing scan. Future improvements could be made through better parallelization of GPUs at generation, but also through more efficient models using methods similar to MobileNet [39] or EfficientNet [40]. Furthermore, advancements in graphical processing units continue to grow exponentially, and modern GPUs could considerably bring down generation times.
PixelCNN++ is known for being affected by long range dependence, which makes pixels that are farther apart more difficult to predict. This could be partially alleviated by using self attention which showed modest improvements in the PixelSnail model [41]. Self attention was omitted from the current version of PixelMiner due to time constraints. With the development of the attention based model called a transformer [42], it could be possible to build an improved version of PixelMiner similar to the transformer based image generation model ImageGPT [43]. Transformers perform faster than RNNs but also address the issues with long range dependencies in PixelCNN and PixelRNN. Transformer based image models are a relatively new development in computer vision models, but have the potential to push the capabilities of PixelMiner much further.

Conclusion
PixelMiner provides a new state of the art method for performing slice interpolation on medical imaging capable of improving imaging resolution while better preserving texture features, an important set of features in quantitative medical imaging analysis. PixelMiner performs well not only on the training dataset but also on the validated external datasets. Figure 1 Diagram A is a schematic depiction of the PixelMiner convolutional block. Diagram B is the entire architecture where the input goes through a 2D convolution and the output goes through an identity convolutional. The architecture is made up of N residual blocks with two down samples using two stride convolutions and two up samples using transposed convolutions.

Figure 2
A graphic representation of the interpolation technique. In this example, a scan with nine slices with large slice spacings is upsampled with eight additional slices. The dots represent the center of each slice, and dotted lines represent new borders following interpolation, and the arrows represent the generated slices being inserted.

Figure 3
An example of an interpolated slice being generated in patches. The region contained in the red box indicates the next patch of 64 x 64 to be fed into the model. The yellow patch of 32 x 32 in the bottom right of the red box is the sub-patch of pixels that are to be generated next.

Figure 4
Overview of the model training, testing, and validation steps. As the model is unsupervised, there is no need for labels. It is trained with three inputs, where the middle slice is what the model learns to interpolate. At test time the model receives two inputs and produces a single output, where the output is compared to the ground truth. The production model takes in two inputs and outputs the interpolated slice. The output slices are combined together with the original slices to create the nal upsampled scan.  Supplementary Files