## Unsupervised learning for generating VMIs

Based on the assumption that the linear attenuation coefficients at low and high energies can be expressed as a linear combination of the effective mass attenuation coefficients of two basis materials [23, 24], VMI obtained from image space data is a linear combination of DECT images [8] which can be written as follows:

$$\text{V}\text{M}\text{I}\left(\text{E}\right)=\text{w}\left(\text{E}\right)\times {\text{C}\text{T}}^{\text{L}}+(1-\text{w}(\text{E}\left)\right)\times {\text{C}\text{T}}^{\text{H}}$$

1

where E is an energy level (keV), w(E) is an energy-dependent weighting factor, CTL is the low-kV CT image, and CTH is the high-kV CT image. Because w(E) is larger than 1 at low energy levels (E < 60 keV), the image noise from CTL is amplified. As a result, the low-keV VMIs often suffer from severe noise. In contrast, w(E) is larger than 1 at high energy levels (E > 100 keV). This indicates that the image quality of high-keV VMIs should be similar to that of CTH. Although the VMI + technique that mixes the VMI(E) with the 70-keV VMI can improve the quality of VMI at each energy level [9], the improvement is limited at high energy levels.

To further improve the quality of DECT-based VMIs, we propose an unsupervised learning based method to generate VMIs from DECT. Based on the concept of DIP [19], the VMI at the energy level E can be generated by learning a neural network that maps DECT images to VMI(E). This relation can be described as follows:

$$\text{V}\text{M}\text{I}\left(\text{E}\right)={\text{f}}_{{\theta }}({\text{C}\text{T}}^{\text{L}},{\text{C}\text{T}}^{\text{H}})$$

2

where \({\text{f}}_{{\theta }}\) is a convolution neural network (CNN) with network parameters (\({\theta }\)). Both CTL and CTH are model inputs. Because of the strong power of CNN, it is possible to generate more than one VMI(E). Thus, we design a CNN model that can generate three different keV VMIs. Therefore, Eq. (2) can be rewritten as follows:

$$\{\text{V}\text{M}\text{I}\left({\text{E}}_{1}\right),\text{V}\text{M}\text{I}\left({\text{E}}_{2}\right),\text{V}\text{M}\text{I}\left({\text{E}}_{3}\right)\}={\text{f}}_{{\theta }}({\text{C}\text{T}}^{\text{L}},{\text{C}\text{T}}^{\text{H}})$$

3

As shown in Fig. 1, the measured DECT images are fed into a U-Net model [25] which output three different keV VMIs. To unsupervisedly achieve this aim, each paired VMI set is used to re-calculate the DECT images based on Eq. (1). As a result, there are three paired DL-derived DECT imaging sets. By minimizing the differences between the measured and DL-derived DECT images, the U-Net model can be constrained itself to generate three different keV VMIs directly from the measured DECT images. The loss function can be described as follows:

$${{\theta }}^{\text{*}}=\underset{{\theta }}{\text{argmin}}{‖\text{g}\left(\{\text{V}\text{M}\text{I}\left({\text{E}}_{1}\right),\text{V}\text{M}\text{I}\left({\text{E}}_{2}\right)\}\right)-\{{\text{C}\text{T}}^{\text{L}},{\text{C}\text{T}}^{\text{H}}\}‖}_{2}^{2}+{‖\text{g}\left(\{\text{V}\text{M}\text{I}\left({\text{E}}_{2}\right),\text{V}\text{M}\text{I}\left({\text{E}}_{3}\right)\}\right)-\{{\text{C}\text{T}}^{\text{L}},{\text{C}\text{T}}^{\text{H}}\}‖}_{2}^{2}+{‖\text{g}\left(\{\text{V}\text{M}\text{I}\left({\text{E}}_{1}\right),\text{V}\text{M}\text{I}\left({\text{E}}_{3}\right)\}\right)-\{{\text{C}\text{T}}^{\text{L}},{\text{C}\text{T}}^{\text{H}}\}‖}_{2}^{2}$$

4

where g is a custom function that solves Eq. (1) given two VMIs. With each model-predicted paired VMI (i.e. VMI(E1) and VMI(E2), VMI(E1) and VMI(E3), and VMI(E2) and VMI(E3)) and known w(E), we are able to solve Eq. (1) and obtain DECT images. Note that there are three paired DL-derived DECT imaging sets which are derived from three model-predicted paired VMIs. In this study, we selected three different keV VMIs ranging from low to high energy levels. Specifically, E1, E2, and E3 were set to 40 keV, 70 keV, and 100 keV, respectively. Note that mapping the measured DECT images to more than three different keV VMIs is possible. However, the more VMIs we generate, the more network parameters the model requires. The more parameters the model has, the more difficult it is to learn. More importantly, it takes more time to optimize a CNN model with more network parameters. Based on our preliminary tests, it is reasonable to use one CNN model to simultaneously generate three different keV VMIs.

As described above, one CNN model should be able to simultaneously generate three different keV VMIs. This indicates that we need to train more than one CNN model in order to generate a wide range of different energy VMIs. For example, 4 CNN models are required to generate twelve VMIs at the range of 40–150 keV (10-keV interval). It would be time-consuming if there are many CT slices to be processed. One solution to generate the other nine VMIs from one CNN model is that the average of the three DL-derived DECT imaging sets obtained from one learned CNN model can be used together with Eq. (1) and known w(E) to calculate the other nine VMIs. Because the average of the three DL-derived DECT imaging sets should have better image quality than the measured DECT images, the quality of the calculated nine VMIs should be improved. This indicates that the proposed DL-based method has the ability to generate a wide range of different energy VMIs from one learned CNN model.

In this study, the U-Net model shown in Fig. 2 was trained using the mean squared error (MSE) loss function. We used the adaptive moment estimation algorithm with the default parameters (learning rate = 1e-4, beta1 = 0.9, beta2 = 0.999 and epsilon = 1e-8) to minimize the MSE loss function. The number of epochs was set to 2000, and the batch size was set to 1. The U-Net model was implemented using PyTorch, and the training process was run on a computer with a NVIDIA Titan XP GPU. DECT images were normalized to values between 0 and 1 before the training process. For qualitative and quantitative comparison, all DECT-based and DL-based VMIs were multiplied by 4095 and then subtracted the results by 1024. The range of CT numbers was − 1024 to 3071 Hounsfield units (HUs).