3.1 Datasets
In the experiment, we use three publicly available datasets, including LiTS17, 3DIRCADb, and SLiver07, acquired by a wide variety of CT scanners from different vendors (The three datasets are publicly available at https://competitions.codalab.org/competitions/17094, https://www.ircad.fr/research/3d-ircadb-01/, and https://sliver07.grand-challenge.org/). We outline their specifications in Table 1, and we confirm that all experiments were performed in accordance with relevant guidelines and regulations.
Table 1
The specifications of the experimental datasets ("-" means none)
Datasets
|
Training
|
Test
|
Size
|
In-plane resolution
|
Inter-slice resolution
|
Slice num
|
LiTS17
|
130
|
70
|
512⊆512
|
0.55mm ~ 1.0mm
|
0.45mm ~ 6.0mm
|
42 ~ 1026
|
3DIRCADb
|
20
|
-
|
512⊆512
|
0.56mm ~ 0.81mm
|
1.25mm ~ 4mm
|
74 ~ 225
|
SLiver07
|
20
|
10
|
512⊆512
|
0.56mm ~ 0.8mm
|
1mm ~ 3mm
|
64 ~ 394
|
Considering that LiTS17 and SLiver07 do not provide the golden standards for the test sets, we randomly divided LiTS17 and Sliver07 training datasets into new training sets and new test sets according to 116/15 and 10/10, respectively. In addition, since 3DIRCADb does not provide a test set, its 20 datasets with golden standards are randomly divided into training and test sets according to 10/10.
3.2 Settings
During the training, we set the entire batch to 800 and batch size to 1. The initial learning rate is set to 0.001, and the learning rate is adjusted according to the set interval number. The selection of interval numbers depends on the batch setting, and the selected batches are 350 and 650. The learning rate is updated according to the formula \(lr=initial\_lr\times \gamma\); that is, when the training batch reaches 350 and 650, the learning rate begins to attenuate, where the initial value of γ is 0.1. We use standard Adam to optimize the objective function. All the experiments were run on a PC with Ubuntu 18.04, equipped single Intel Xeon silver 4110 CPU, RTX2080ti GPU, 64G RAM, and Pytorch1.4 as the deep learning framework.
Figure 2 shows the training and verification process of the proposed 3D DA-UNet when α and β take different values for the Tversky loss function. As can be seen from the figures, when α and β take 0.4/0.6 (red), the loss in the training and verification phases fluctuates slightly, with the highest initial value of Dice, which indicates the most conducive to avoiding the gradient vanishing/explosion problem. Therefore, we empirically set the hyperparameters to 0.4 and 0.6 in this paper.
To improve computational efficiency, we preprocess the datasets. First, we downsample each volume of the input to 256⊆256. Second, we locate the initial and final sections of the liver region and expand 20 slices outward. Finally, to exclude irrelevant organs, we unified Hounsfield intensity to [− 200, 200], set the spacing of the z-axis of all data to 1mm, and normalized the intensity to [0,1].
3.3 Ablations
In this section, we implement the ablation experiments on three public datasets to verify the effectiveness of the proposed model combination. A total of three combined models were adopted, including 3D U-Net (Baseline), 3D ResU-Net (+ Res), 3D AI-UNet (+ Res + AI), and 3D DA-UNet (+ Res + AI + DS). It can be seen from Table 2that, with the superposition of the network modules, the segmentation performances are incrementally improved.
Table 2
Quantitative analysis results of ablation experiments on three databases.
Dataset
|
Model
|
Dice (%)
|
VOE (%)
|
RVD (%)
|
ASD (mm)
|
RMSD (mm)
|
LiTS17
|
3D U-Net
|
91.96 ± 0.65
|
7.76 ± 1.20
|
0.87 ± 0.28
|
1.58 ± 0.57
|
5.26 ± 4.59
|
+CRF
|
92.63 ± 1.44
|
6.49 ± 2.63
|
0.68 ± 0.31
|
1.62 ± 1.44
|
5.09 ± 4.73
|
3D ResU-Net
|
94.62 ± 0.50
|
7.54 ± 0.93
|
0.68 ± 0.26
|
1.34 ± 0.64
|
4.94 ± 5.51
|
+CRF
|
95.15 ± 0.59
|
5.54 ± 1.11
|
0.57 ± 0.16
|
1.49 ± 0.92
|
4.25 ± 5 .47
|
3D AI-UNet
|
95.01 ± 0.53
|
7.23 ± 0.89
|
0.61 ± 0.29
|
1.36 ± 0.71
|
4.88 ± 5.12
|
+CRF
|
95.85 ± 0.49
|
6.51 ± 0.71
|
0.55 ± 0.21
|
1.30 ± 0.85
|
4.51 ± 3.25
|
3D DA-UNet
|
96.71 ± 0.45
|
6.37 ± 0.84
|
0.53 ± 0.19
|
1.22 ± 0.25
|
4.54 ± 5.00
|
+CRF
|
97.62 ± 0.27
|
4.64 ± 0.51
|
0.42 ± 0.11
|
1.07 ± 0.49
|
2.39 ± 0.72
|
3DIRCADb
|
3D U-Net
|
92.97 ± 0.65
|
7.73 ± 1.26
|
0.40 ± 0.13
|
3.44 ± 1.83
|
8.68 ± 6.88
|
+CRF
|
94.96 ± 0.59
|
7.59 ± 1.19
|
0.39 ± 0.15
|
2.65 ± 2.02
|
8.13 ± 7.83
|
3D ResUNet
|
95.80 ± 0.59
|
7.08 ± 1.27
|
0.36 ± 0.23
|
1.45 ± 1.91
|
3.26 ± 1.04
|
+CRF
|
97.10 ± 0.16
|
5.62 ± 1.16
|
0.31 ± 0.12
|
0.98 ± 1.31
|
2.68 ± 1.21
|
3D AI-UNet
|
95.91 ± 0.61
|
6.81 ± 1.32
|
0.32 ± 0.26
|
1.49 ± 1.65
|
3.13 ± 1.28
|
+CRF
|
96.12 ± 0.25
|
5.12 ± 1.12
|
0.29 ± 0.32
|
1.32 ± 1.22
|
2.85 ± 1.23
|
3D DA-UNet
|
96.54 ± 0.66
|
6.69 ± 1.24
|
0.22 ± 0.47
|
1.34 ± 0.33
|
2.64 ± 0.59
|
+CRF
|
98.17 ± 0.19
|
3.58 ± 0.38
|
0.18 ± 0.12
|
0.95 ± 1.31
|
2.57 ± 0.32
|
SLiver07
|
3D U-Net
|
94.24 ± 0.57
|
6.11 ± 1.31
|
0.73 ± 0.28
|
3.76 ± 3.10
|
9.93 ± 8.14
|
+CRF
|
94.63 ± 0.52
|
5.10 ± 1.03
|
0.71 ± 0.24
|
3.36 ± 2.01
|
8.86 ± 6.77
|
3D ResUNet
|
96.43 ± 0.38
|
4.29 ± 1.03
|
0.48 ± 0.26
|
2.02 ± 1.53
|
6.87 ± 5.83
|
+CRF
|
97.14 ± 0.39
|
3.51 ± 0.85
|
0.36 ± 0.19
|
1.70 ± 0.88
|
5.56 ± 4.58
|
3D AI-UNet
|
97.12 ± 0.42
|
4.31 ± 0.99
|
0.42 ± 0.25
|
1.58 ± 1.23
|
5.58 ± 4.65
|
+CRF
|
97.51 ± 0.39
|
3.65 ± 0.78
|
0.39 ± 0.28
|
1.38 ± 0.95
|
4.12 ± 3.21
|
3D DA-UNet
|
97.84 ± 0.33
|
4.23 ± 0.63
|
0.21 ± 0.35
|
1.09 ± 0.09
|
4.77 ± 5.06
|
+CRF
|
98.68 ± 0.36
|
2.61 ± 0.51
|
0.19 ± 0.14
|
1.07 ± 0.06
|
3.40 ± 4.17
|
As the residual structure was added and compared with 3D U-Net, the 3D ResU-Net achieved superior performance on the main metrics, proving the contribution of residual structure to the performance improvement. Then as the AI module was superposed, the segmentation accuracy of the 3D AI-UNet was improved. Finally, as the DS was integrated, the performance of the 3D DA-UNet was further enhanced, proving the DS's effectiveness.
Moreover, to validate the effectiveness of 3D dense CRF post-processing [18], we added the 3D dense CRF process based on the above ablations. It can be seen from Table 2 that, after employing 3D dense CRF, the performances of all models are improved.
Figure 3 shows some typical visual results of the ablations on three datasets. In Fig. 3(a) and (c), 3D U-Net showed severe over-segmentation errors. As 3DResU-Net is introduced, the segmentation result is significantly improved, mainly due to the employment of residual structure that makes the network deeper and wider to extract more image features. Then, after the AI module is integrated into 3DResU-Net, the fuzzy liver boundary is refined. The segmentation accuracy is improved since the AI module extracts more image features of different scales. Finally, when the DS is employed, thanks to the rationality of the top-layer output, the 3D DA-UNet further improves the segmentation accuracy, with refinements on details of some small areas. Figure 3(b) demonstrated a typical liver case adjacent to other organs, in which 3D U-Net shows a severe under-segmentation error. As the residual structure, AI, and the DS are integrated, the under-segmentation error is reduced continuously. Moreover, when 3D dense CRF is employed, the segmentation errors are reduced, and the result reaches the best in all cases.
3.4 Time-costs
Table 3 provides the training and test time of the proposed method on three different datasets. Compared with 3D U-Net, 3D ResU-Net, and 3D AI-UNet, the proposed DA-UNet method enabled a deeper and wider network while inevitably increasing the time cost of training and testing. It can be seen from the table that the proposed method takes the least training and test time. In addition, due to the complexity of the proposed method, the average training and test time without/with post-processing are slightly higher than those of 3D U-Net, 3D ResU-Net, and 3D AI-UNet. Moreover, we also found that the test time is significantly increased after employing CRF. Nevertheless, it is acceptable to trade off the best segmentation performance at a certain time cost.
Table 3
Training and testing time-costs of various methods on three different datasets
Datasets
|
Method
|
Training time
|
Test time
|
LiTS17
|
3D U-Net
|
57h 42min 17s
|
42.78s
|
3D ResU-Net
|
59h 13min 54s
|
48.46s
|
3D AI-UNet
|
60h 23min 45s
|
49.32s
|
3D DA-UNet
|
61h 28min 12s
|
51.76s
|
3D DA-UNet + CRF
|
61h 28min 12s
|
6min14s
|
3DIRCADb
|
3D U-Net
|
31h 59min 35s
|
13.75s
|
3D ResU-Net
|
32h 02min 30s
|
14.23s
|
3D AI-UNet
|
33h34min29s
|
14.67s
|
3D DA-UNet
|
34h 01min 50s
|
15.19s
|
3D DA-UNet + CRF
|
34h 01min 50s
|
2min 15s
|
SLiver07
|
3D U-Net
|
26h 09min 09s
|
27.93s
|
3D ResU-Net
|
26h 24min 17s
|
28.18s
|
3D AI-UNet
|
26h 57min 49s
|
28.65s
|
3D DA-UNet
|
27h 59min 37s
|
29.02s
|
3D DA-UNet + CRF
|
27h 59min 37s
|
3min 25s
|
3.5 Comparisons
Table 4 compares the proposed methods in the 3DIRCADb test dataset with the deep learning-based SOTA methods. It can be seen from the table that our results are superior to other listed 2D-based methods on the five metrics. However, it is slightly inferior to the 3D H-DenseUNet on Dice and RVD proposed by Li et al. [11].
Table 4
Comparisons with other SOTA methods on 3DIRCADb test datasets
Model
|
Method
|
Dice(%)
|
VOE(%)
|
RVD(%)
|
ASD(mm)
|
RMSD(mm)
|
Christ [3]
|
2D Cascaded FCN
|
94.30
|
10.70
|
-1.40
|
1.50
|
24.00
|
Chlebus [25]
|
2D U-Net
|
92.30 ± 0.03
|
14.21 ± 5.71
|
-0.05 ± 0.10
|
4.33 ± 3.39
|
8.35 ± 7.54
|
Han [6]
|
2D ResNet
|
93.80 ± 0.02
|
11.65 ± 4.06
|
-0.03 ± 0.06
|
3.91 ± 3.95
|
8.11 ± 9.68
|
Seo [13]
|
2D mU-Net
|
96.01 ± 1.08
|
9.73 ± 2.91
|
0.38 ± 0.12
|
3.11 ± 0.84
|
9.20 ± 3.43
|
Li [11]
|
3D H-DenseUNet
|
98.20 ± 0.01
|
9.36 ± 3.34
|
0.01 ± 0.02
|
1.28 ± 2.02
|
3.58 ± 6.58
|
Proposed
|
3D DA-UNet + CRF
|
98.17 ± 0.19
|
3.58 ± 0.38
|
0.18 ± 0.12
|
0.95 ± 1.31
|
2.57 ± 0.32
|
On the one hand, such results are due to the use of 3D convolution so that adjacent slices' spatial information is effectively used, and on the other hand, it benefits from the DS mechanism, which simultaneously establishes short-circuit connections and dense connections between the front and back layers of the network, and achieves feature reuse. Besides, the DS effectively solves the problem of gradient explosion/vanishment in model training. It makes the update process of the hidden layer filter more inclined to focus on high-resolution object features. Thus, to some extent, it is proved that the DS can make the model pay more attention to the target region.
Moreover, the general network output is the prediction with the maximum probability; there is no guarantee that each prediction is correct. However, the CRF has a transfer feature, which considers the order of output labels. The CRF layer can add some constraints to the last predicted labels to ensure that the predicted labels are legal, which can be learned automatically from the CRF layer.
3.6 Challenges
To evaluate the performance of the proposed method, we participated in the MICCIA-LiTS17 challenge and compared our proposed approach with other published deep learning-based methods. Table 5 lists the comparing results of the top-ranked SOTA methods (our team's name: HUSTWH402) (The result is publicly available at https://competitions.codalab.org/competitions/17094#results).
Table 5
Comparisons on LiTS17 challenge
Method
|
Dimension
|
Model
|
DPC (%)
|
DG (%)
|
Roth et al. (2019) [10]
|
2D
|
U-Net
|
95.0
|
94.0
|
Kaluva et al. (2018) [7]
|
FCN
|
91.2
|
92.3
|
Liu et al. (2019) [22]
|
GIU-Net
|
-
|
95.05
|
Song et al.(2020) [24]
|
BS U-Net
|
96.1
|
96.4
|
Li et al.(2018) [11]
|
3D
|
H-DenseUNet
|
96.1
|
96.5
|
Jin et al.(2018) [12]
|
RA U-Net
|
96.3
|
96.1
|
Rafiei et al.(2018) [23]
|
U-Net
|
-
|
92.8
|
Ours
|
DA-UNet
|
95.3
|
95.8
|
It can be seen from Table 5 that our proposed method obtained 95.3% for Dice per case (DPC) and 95.8% for Dice global (DG) (ranking 16th and 13th, respectively). Although our result surpasses most D-based segmentation approaches, it is slightly lower than H-DenseUNet and Ra U-Net. As for the reason, the two 3D-based methods both employed 2D pre-training before formal 3D network processing. For example, Li et al. [11] first used deep 2D DenseUNet for intra-slice feature extraction and then 3D H-DenseUNet for hybrid feature exploration. Similarly, Jin et al. [12] used 2D input for liver localization (RA-UNet-I) and then used 3D input for liver segmentation (RA-UNet-II). Therefore, although their segmentation accuracy is improved, the end-to-end framework is destroyed to a certain extent.
3.7 Advantages
This section illustrates some challenging cases using the proposed method. Figure 4(a-b) shows livers with fuzzy boundaries. As can be seen from the figures, the blurred edge connecting the liver region is segmented with slight error. Figure 4(c-d) shows a discontinuous liver with adjacent organs. It can be seen that the model shows a slight over-segmentation error. Figure 4(e) liver with blood vessels inside. There is a slight error around the vessel regions. However, after 3D dense CRF post-processing, the segmentation result is close to the ground truth.
The proposed model demonstrates superiority in handling challenging cases such as large and small liver regions, liver discontinuities, and livers containing blood vessels. The main reasons are as follows: Firstly, we upgrade the 2D convolution of U-Net to 3D convolution, making full use of the information between slices. Secondly, by adding residual connections between each convolution block, we make the network passes through a convolutional layer with a residual structure to extract more complex related features. Thirdly, the introduced DS mechanism into the decoding area makes the network focus on the relevant features of the shallow layer, which enable the top layer to output better discrimination and higher accuracy. Fourthly, using the Tversky loss function, by adjusting the parameters α and β, the model makes a good trade-off between FP and FN, effectively avoiding over-/under-segmentation. Finally, the 3D dense CRF is used as the post-processing to optimize the tiny boundaries further.
3.8 Limitations
We illustrate some limitations on liver cases with neighboring organs of low contrast. When pathology liver tumors are at the boundary, our proposed method may result in significant over-/under-segmentation errors (Fig. 5). Therefore, the proposed model could achieve superior results when the liver contains low contrast with neighboring organs. However, it is still prone to errors when part of the liver border contains tumors.