Data resource
The dataset used in this study is BraTs2018. There are 285 and 66 cases for training and validation set, respectively.
Model architecture
This study develops 2D and 3D segmentation models of an improved U-Net3+ segmentation network based on stage residual, as shown in Fig. 1 and Fig. 6, respectively. The work and main contributions of this study are as follows:
(i) In the encoder part, an encoder based on the stage residual structure is proposed. This structure reduces the degradation problem caused by the increase in network depth. Besides, it improves the feature extraction ability of U-Net3+ during down-sampling and provides richer semantic information for up-sampling.
(ii) The normalization layer is replaced with FRN(Filter Response Normalization)[11] instead of BN(Batch Normalization)[13], eliminating the impact on the batch size. The performance can surpass BN when the batch size is large. The network uses an improved version of the ReLU activation function, TLU, which can have certain learning capabilities.
(iii) Based on the stage residual structure Unet3+ 2D model, we reconstructed the IResUnet3+ 3D model and used block processing to process the 3D data to achieve the 3D network segmentation. The proposed model achieves a segmentation effect similar to the 3D V-Net model with 40M parameters at the cost of extremely small parameters.
The experimental results showed that the proposed network model improves the segmentation accuracy of small areas, and the edge segmentation of tumors is smoother and more accurate.
Data Preprocessing
Fig.2 shows the preprocessing flowchart of this study.
For the BraTS2018 dataset used in this experiment, there are four modalities: T1-weighted images, T2-weighted images, fluid-attenuated inversion recovery (FLAIR), and contrast-enhanced T1-weighted images (T1C). Since there are differences in contrast between modes, each mode needs to be standardized. The corresponding ground truth has three labels: edema area (ED), enhanced tumor area (ET), and non-enhancing tumor (NET). The above labels are divided into three different segmentation nested regions: whole tumor (WT), tumor core (TC), enhancing tumor (ET). Then, merge the channels of the four modalities and the three-segmented regions cut out the redundant background, making patch is performed to adapt to the 2D network segmentation. Finally, save it as Npy file.
Encoder Based on Stage Residual
To solve the network degradation problem, researchers often use the residual structure proposed by He et al. [16] to train deep networks. However, this structure causes some other problems, for example, the number of ReLU on the main path of the residual structure is proportional to the network’s depth, But the information flow with negative weight will be cleared after the ReLU activation function. This feature makes the information flow much affected in the propagation process. To solve such problems, He et al. [17] proposed the pre-activation structure. The principle is to put the ReLU away from the main path. Although the above problem is solved, it causes new problems. Due to the non-linear nature of the activation function, the network can’t learn the non-linear relationship in the data. If there is no non-linear activation function in the residual structure, it will result in the lack of nonlinearity between different residual blocks, which also increases the difficulty of learning the network. The main path of the standard residual and pre-activated structures is not normalized. Thus, the entire signal (the added signal) is not completely normalized, increasing the difficulty of network convergence.
Based on this result, Ionut et al. [18] proposed a stage residual structure. As shown in Fig.3, the principle is to divide the network into different stages. Each stage consists of a start residual block, several middle residual blocks (any number can be used), and an end residual block. Thus, no matter how the network depth changes, if the number of stages remains the same, the number of ReLU on the main path will not change. This allows the signal to reduce many bad effects caused by ReLU when passing through the multilayer network. It also obtains the non-linear benefits of ReLU. After the end of the residual block, the entire signal is normalized, accelerating the network convergence.
Based on the structural advantages of the stage residual, this study combines it with the encoder to improve the feature extraction ability during down-sampling. The improved encoder consists of a start residual block, several middle residual blocks, and an end residual block. The number of middle residual blocks is set to 0 to ensure that the number of 3×3 convolutions is consistent with that in the benchmark network.
Full-scale Skip Connection
In addition to improving the encoder part, skip connections are the focus of attention, such as U-Net++[6] designed architecture with nested and dense skip connections based on U-Net. However, Huang et al. [7] believe that U-Net++ does not have enough information from multiple scales; thus, they proposed U-Net3+. It uses full-scale skip connections to combine high-level and low-level semantics from different scales to provide richer information for up-sampling.
Fig. 4 explains how to structure the feature map. Similar to U-Net, directly receive feature maps from the same scale encoder layer. But the difference is, there is more than one skip connection above. Among them, the above two skip connections perform pooling down-sampling of the lower-level encoder layers and through different maxpooling operations, to transmit the low-level semantic information. The reason for pooling down-sampling is to unify the resolution of the feature map. It can be seen from the figure, has to reduce the resolution four times and has to reduce the resolution two times. The next two skip connections use bilinear interpolation to up-sample and in the decoder to enlarge the resolution of the feature map. It can be seen from the figure, has to enlarge the resolution four times and has to enlarge the resolution two times. After unifying the size of the feature maps, it is necessary to unify the number of channels. After convolution through 3×3 convolution with 64 channels, it will be concatenated together along the channel dimension, and then the feature fusion is performed. After fusion, a new feature map with 320 channels is generated.
FRN
The experimental comparison showed that in each stage, no matter U-Net, U-NET++, or U-Net3+, batch normalize [13] is used to normalize the data passing through the convolution layer, which makes the whole network limited by the batch size N. When the batch size N is small, the network effect will be very poor. Although group normalization [12] proposed by He is not affected by the batch size, it has not been widely used. Besides, it is not easy to compete with BN when the batch size is large. FRN [11] breaks the influence of batch size and surpasses BN when batch size is large.
Fig.5 shows the calculation process of FRN. The input data X refers to the data of a characteristic graph (H, W); thus, it has nothing to do with the N representing the batch size. Its calculation process is slightly different from other normalization layers [15,17]. It omits the operation of subtracting the mean value and changes the variance to the mean value of the quadratic norm of. Similarly, scaling and panning are required after normalization. Where is a small constant to prevent the denominator from being zero. Besides, FRN does not perform any subtraction of average value, so it may lead to the result far from zero after normalization. When FRN is activated by ReLU after normalization, many 0 or 1 values may be generated, which is detrimental to model training and performance. To solve this problem, we use threshold ReLU to eliminate the bias phenomenon, namely TLU, as shown in formula (1).
The parameter is a learnable parameter. Saurabh et al. [11] found that TLU is very important to improve the performance after FRN normalization.
Loss Function
In medical image segmentation, data imbalance is a very common problem. In general, the number of lesion voxels in most datasets is much lower than that of non-lesion voxels, the same is true for brain tumor datasets, and the area of brain tumor is much smaller than that of the brain region. To solve this problem, Fausto et al. [15] proposed a loss function based on Dice coefficient, which significantly alleviates this imbalance phenomenon and makes the network learn effectively. But for small target segmentation, once it is not detected, Dice loss fluctuates violently. Thus, this study selects the mixed loss function of the combination of cross-entropy loss and generalized Dice loss and gives them corresponding weights. The formula is shown in (2)
The parameters of loss function are α = 0.5 and β = 1.0.
3D Model Based on IResUnet3+
In the front part, a 2D neural network is used to segment brain tumor magnetic resonance imaging (MRI). Although the proposed IResUnet3+ network has been significantly improved, there are still some false alarms in the normal tissue areas around the brain tumor, i.e., many outliers are predicted in the surrounding areas. This is because the MRI sequence is originally 3D data, but it is sliced in the preprocessing of the 2D network, which causes the patch data to lose much spatial information, leading to insufficient network learning. Thus, this study develops the IResUnet3+ 3D model to discuss the effect of 3D model brain tumor segmentation based on the proposed IResUnet3+ 2D model. The structure of the proposed IResUnet3+ 3D model is the same as that of the 2D model, except that 3D convolution is used instead of 2D convolution, and FRN and TLU are improved to adapt to the 3D input data. The major difference from the 2D model is the data preprocessing part; it will be explained in detail in subsection 3.1. The IResUnet3+3D model diagram is shown below.
3D Data Preprocessing
Due to the limited experimental resources and conditions, it is not possible to directly input the complete 3D data into the network. To achieve the 3D network segmentation, the 3D data is divided into blocks. Different from the making patch of the 2D network, the block data is still 3D data. The preprocess method is divided into five steps: First, manually add five black slices to meet the requirements of the block method. Add three black slices to the front of the four modal images (155, 240, 240) and the corresponding mask (155, 240, 240). Then, add two to the back, and finally, all become (160, 240, 240). After normalization and crop, block processing is conducted. Fig. 6 shows one of the making block methods A, the cropped image and label size are both (160,160,160), the block size set is (32,160,160), the moving step is 32, i.e., five blocks of (32,160,160) size are divided from the Z-axis direction.
Data preprocessing plays a decisive role in model training. Poor preprocess may result in insufficient training or even failure. For the data preprocessing of 3D networks, the making block method is a step worth paying attention to. After experiments, although the above making block method (Fig.7) is simple, there is a lack of correlation between blocks, making the network unable to fully learn the structural relationship between all blocks and blocks while training. The block at this moment is similar to the slice in the 2D network. Although it contains more 3D structural information than the slice, there is still a lack of connection between the blocks. Besides, the network cannot learn the interconnection between structures. Thus, we explored another making block method, as shown in Fig. 8, and called it making block method B for distinction. Making block method B is also simple. The size of the block is not changed, but the moving step of the block is set to 8, i.e., a block of (32,160,160) size is taken for every eight movements in the Z-axis direction.
To compare the differences between the two making block methods, method A and method B are used to process the data, and the V-Net network is trained under the same experimental conditions. At the same time, set the early stopping method to supervise the training process. When the accuracy of the validation set does not improve after a certain number of epochs, the early stopping method is triggered to end the training. Fig. 9 shows the comparison of the training process on the data obtained in the two methods. The figure shows that the data obtained in method A lacks the mutual connection information between blocks so that the model cannot be fully trained. When the early stopping method is triggered, the model loss remains at a high-level. The data obtained in method B enables the network to fully learn the 3Dl structural information of all data in the dataset, which is beneficial to the convergence and accuracy of the network
In summary, data preprocessing plays a pivotal role in model training. Compared to making block method A, making block method B allows the data in the dataset to be related to each other, allowing more structural information to be captured during model training, which is conducive to network learning and training.
Experiment and Analysis
Experimental Environment
The operating environment: Win10, Intel Core i7-8700@ 3.20 GHz six-core CPU, memory 32 GB, graphics card Nvidia GeForce GTX 1080Ti (11GB/Gigabyte), Pytorch1.4.0, Python3.6. The Adam optimizer is used for gradient descent, the learning rate is 0.03, and the batch size is 2.
Analysis of feature extraction ability
In medical image segmentation, we expect to obtain a binary image that only contains the lesion location (the lesion location is positive number, and the rest is 0). Therefore, our neural network model should have the ability to identify the lesion location, highlight the lesion location, and weaken the non-lesion location. And the feature extraction ability of the model is also reflected in the perception of the lesion location. In 2.2, we mentioned that the feature extraction ability of the improved encoder based on stage residual has been significantly improved. Therefore, we show the output results of the proposed model’s encoder layer through visual method, and compare it with U-Net. As shown in the Fig.10, the U-Net model has poor perception of the lesion location in the input image, and the model's attention is scattered throughout the image instead of the lesion location. And the proposed model is very sensitive to the lesion location and can better identify and highlight the lesion location and weaken the non-lesion location. This also indicates that the feature extraction ability of the encoder is improved after adding the stage residual structure.
Comparative Analysis of Different Methods
On the same dataset, the proposed model is tested and compared with the existing mainstream models. 2D and 3D models are constructed to examine the difference in the performance of the 2D and 3D models under the task of brain tumor segmentation. The mainstream medical image segmentation models used for comparison are U-Net, U-Net++, U-Net3+, and ResUnet, with the experimental results shown in Fig.11, Fig.12, and Fig.13, respectively.
Comparative Analysis of 2D and 3D Models
Comparing the 2D and 3D models of IResUnet3+ on the segmentation effect of brain tumors, the 2D model has a large area of misjudgment and additional judgments when predicting the 3D brain tumor data. This is because the input data of the 2D model is one picture by one. The network cannot learn the connection between pictures. The input data of the 3D network is a 3D block, which itself contains 3D structural information. Besides, we use making block method B. The network can further obtain the connection between blocks, and further help the network to learn the 3D structural information of tumor lesions, improving the accuracy of brain tumor segmentation and reducing misjudgment rate.
Comparative Analysis of U-Net and its Variant Models
Comparing the segmentation effect of U-Net, U-Net++, and U-Net3+ models for brain tumors, we obtained that the classical network U-Net is based on its encoder-decoder network structure, and skip connection can connect encoder-decoder layer to merge low-level and high-level features to better perform basic segmentation of tumor lesions. However, there are still many problems, such as misjudgments, additional judgments, and low accuracy. U-Net++ designs architecture with nested and dense skip connections based on U-Net. The four U-Net networks of different depths are spliced together through multiple skip connections, which help to fully integrate features at the same scale. However, it does not perform feature fusion between different scales, and there may be a problem with feature redundancy. Based on this, U-Net3+ is proposed. U-Net3+ proposes a full-scale skip connection while retaining the simple architecture of U-Net one-layer encoder-decoder. The features from different scales are merged through skip connection without feature redundancy. All feature information of different scales appears and is integrated. In contrast, U-Net3+ can achieve better results on segmentation tasks.
Comparing the segmentation effects of U-Net3+, FRN_U-Net3+, and IResUnet3+ models on brain tumors, we obtained that, as described in the previous section, U-Net3+ can segment brain tumors due to its full-scale skip connection. However, it needs further improvements. First, the BN normalization method used in U-Net3+ will limit the network to the batch size. When the batch size is small, the network performance tends to be poor. Thus, we used the FRN normalization layer instead of BN to eliminate the batch size impact on the network. The model obtained under the same batch size training, FRN_U-Net3+ performs significantly better than U-Net3+. It is essential to eliminate the influence of batch size on the network. The traditional Conv-BN-ReLU operation is used in the U-Net3+ encoder part, which shows weak feature extraction ability. We used the improved encoder based on the stage residual to improve the feature extraction ability of the encoder part, which is helpful for the network to learn more feature information and conducive to better feature fusion in the up-sampling.
Finally, all 2D and 3D model segmentation results are shown in Fig.14 and Fig. 15, respectively.
Statistical Analysis of Segmentation Results
We evaluated all models using the validation dataset provided by the BraTS2018 challenge. Table 1 shows the segmentation results. Box plots of all experimental models are displayed in Fig.16. Note that all metrics are calculated through the BraTS2018 online evaluation platform. And two commonly used medical image segmentation evaluation indexes are used to evaluate the segmentation results: Dice coefficient (Dice), Sensitivity (SEN). The formula is shown in (3). Among them, TP is the number of pixels with correct foreground segmentation in pixel-level segmentation, FP is the number of pixels with background segmentation error in pixel-level segmentation, and FN is the foreground segmentation error in pixel-level segmentation. Among them, Dice is used to calculate the similarity between prediction results and labels, and the greater the similarity of Dice, the higher the similarity. And SEN indicates the probability that the lesion will be correctly segmented.
Table 1 Segmentation effect of each model
Model Type
|
Params
|
ET Dice
|
WT Dice
|
TC Dice
|
SEN_ET
|
SEN_WT
|
SEN_TC
|
2DUnet
|
39M
|
72.34
|
86.22
|
73.77
|
77.80
|
85.81
|
71.47
|
2DUnet++
|
36M
|
72.39
|
85.60
|
73.36
|
76.20
|
85.78
|
71.81
|
2DUnet3+
|
27M
|
73.93
|
87.23
|
77.28
|
74.94
|
88.26
|
77.97
|
3DVnet
|
40M
|
76.25
|
88.87
|
78.72
|
80.00
|
91.30
|
83.37
|
3DUnet
|
4.1M
|
67.12
|
87.37
|
73.52
|
65.35
|
89.01
|
80.62
|
3DUnet++
|
6.8M
|
67.12
|
85.81
|
67.66
|
63.91
|
88.38
|
75.99
|
3DResUnet
|
4.2M
|
72.60
|
87.96
|
71.24
|
73.70
|
90.11
|
73.43
|
3Dunet3+
|
5M
|
72.41
|
86.89
|
73.53
|
72.74
|
90.94
|
76.60
|
3D_FRN_Unet3+
|
5M
|
72.22
|
87.74
|
78.59
|
81.18
|
92.28
|
81.20
|
Ours
|
6.6M
|
75.65
|
88.77
|
78.62
|
79.12
|
91.51
|
78.96
|
We obtain the following from the table. First, comparing the segmentation effect of 2D and 3D models, we obtain that the 3D segmentation model has some problems, such as high-computational cost and a large amount of memory. Thus, we have to reduce the scale of the model to reduce its parameters (here, the convolution channel number of each layer of the 3D model = [16,32,64,128,256], while that of the 2D model = [64,128,256,512,1024]; thus, reducing the scale of the 3D model). However, it will decrease the learning ability of the model, and the final segmentation effect will be worse. The proposed model can maintain good learning ability and improve the segmentation effect under the same compression model scale. Second, comparing 3DU-Net, 3DU-Net3+, 3D_FRN_Unet3+, and 3DIResUnet3+, we obtain that the full-scale skip connection proposed by U-Net3+ can provide more information for up-sampling by combining high-level and low-level semantics from different scales; thus, improving the segmentation accuracy. Due to the common BN normalization layer used in U-Net3+, the network is limited by the batch size. Thus, we use the FRN normalization layer instead of BN to solve the problem of the network limited by the batch size, so that the network can be fully trained, and the segmentation accuracy is improved. The Dice coefficients of ET and TC increased by 0.85% and 5.06%, respectively. The sensitivity of WT, TC, and ET increased by 1.34%, 4.6%, and 8.44%, respectively. The encoder is improved based on the stage residuals, which solves the problem of insufficient feature extraction ability of U-Net3+ encoder at the cost of adding a small number of parameters, and provides more semantic information for up-sampling to further improve the segmentation accuracy. The Dice coefficients of ET and WT were further increased by 3.43% and 1.03%, respectively. Comparing 3D IResUet3+ with 3DV-Net, we obtained that the proposed model can achieve the segmentation effect similar to the 3DV-Net model with 40M parameters with minimal parameters. Thus, the IResUnet3+ model is lightweight and effective in brain tumor segmentation.