CMFCUNet: cascaded multi-scale feature calibration UNet for pancreas segmentation

Segmenting the pancreas from abdominal CT scans is challenging since it often takes up a relatively small region. Researchers suggested leveraging coarse-to-fine approaches to cope with this challenge. However, the coarse-scaled segmentation and the fine-scaled segmentation are either trained separately utilizing the coordinates located by the coarse-scaled segmentation mask to crop the fine-scaled segmentation input, or trained jointly utilizing the coarse-scaled segmentation mask to enhance the fine-scaled segmentation input. We argued that these two solutions are complementary to some extent and can promote each other to improve the performance of pancreas segmentation. In addition, the backbone in the coarse-scaled segmentation and fine-scaled segmentation is mostly based on UNet or UNet-like networks, where the multi-scale features transmitted from the encoder to the decoder have not been explored for vertical calibration before. In this paper, we propose a cascaded multi-scale feature calibration UNet (CMFCUNet) for pancreas segmentation where the multi-scale features in the backbone of each scaled segmentation are calibrated vertically in a pixel-wise fashion. Besides, the coarse-scaled segmentation and the fine-scaled segmentation are connected by leveraging a designed dual enhancement module (DEM). Experiments are first conducted on the public NIH pancreas dataset. First, when leveraging CMFCUNet, our method increased by over 3% on the Jaccard index (JI) and nearly 1% on dice similarity coefficient (DSC) which surpasses all existing pancreas segmentation approaches. In addition, our experiments demonstrate that CMFCUNet improved the coarse-to-fine segmentation framework and outperformed the mainstream coarse-to-fine pancreas segmentation approaches. Furthermore, we also conducted ablation studies to analyze the effectiveness of the backbone (MFCUNet) and the DEM. In addition to the experiments on the NIH dataset, we also experimentally demonstrate the excellent generalization of our method on the MSD pancreas dataset.


Introduction
Accurate pancreas segmentation is a prerequisite of utilizing computers to assist doctors for clinical purposes. However, it is not an easy task due to the extremely small size, irregular shape and boundary (Fig. 1). Traditional approaches like graph-based segmentation [1], superpixel-based segmentation [2], morphology-based segmentation [3], and statisticsbased segmentation [4] typically leveraged simple models or were intervened by an artificial inductive bias to segment the pancreas. In addition, traditional approaches tend not to have enough capacity to adapt to the pixel-level segmentation tasks, especially when the segmentation target varies greatly.
Contrary to traditional approaches, Deep learningbased methods such as FCN [5] and UNet [6] have shown 1 3 a strong ability to fit the complex nonlinear segmentation tasks and they have achieved the best performance on many tasks such as liver [7][8][9], kidney [10,11], Cardiomyopathy [12] and small tissues [13]. However, when conducting pancreas segmentation, deep learning-based networks seem to be easily confused by the complex and variable background as the pancreas often accounts for less than 0.5% of the network input [14]. Therefore, researchers proposed the coarse-to-fine segmentation approaches [15][16][17] where the coarse-scaled segmentation roughly localizes the pancreas to suppress the complex background and the fine-scaled segmentation utilizes the localized coordinates cropping the coarse-scaled input to conduct a refined segmentation. Despite their effectiveness, the above coarseto-fine approaches trained the coarse-scaled segmentation and fine-scaled segmentation separately, resulting in the fine-scaled segmentation lacking the context information flow from the coarse-scaled segmentation during the training process. Different from separate training approaches, researchers [18,19] trained the coarse-scaled segmentation and the fine-scaled segmentation jointly utilizing the coarse-scaled masks as weights to enhance the foreground of the fine-scaled input in an end-to-end fashion. Note that the context information in the coarse-to-fine approaches refers to the information of the coarse-scaled segmentation mask passed from the coarse-scaled segmentation network to the fine-scaled segmentation network during the joint training process. Although the joint training solves the problem of context information which the fine-scaled segmentation lacks, irrelevant background regions are not removed. The separate joint and joint training process are shown in Fig. 2. We consider the above separate training and the joint training coarse-to-fine approaches can be complementary. Therefore, we propose a novel cascaded segmentation framework that can not only utilize the coarse-scaled segmentation coordinates to crop the fine-scaled segmentation input to remove the complex background region, but also enable the fine-scaled segmentation to leverage the context information of the coarse-scaled segmentation through joint training.
The key to this novel cascaded framework is the proposed dual enhancement module (DEM) where the coarse-scaled segmentation probability map is first adaptively transformed to multiply the coarse-scaled input, and then the output is cropped by the coarse-scale segmentation localization coordinates. Finally, the cropped images are utilized as the input of the fine-scaled segmentation.
Moreover, the backbone networks in the coarse-scaled segmentation and fine-scaled segmentation are mainly based on UNet or UNet-like models where multi-scale features from the encoder to the decoder are significantly important as they can be applied to repair the fine-grained boundary details [5]. Many approaches [20][21][22] are exploited to fuse multi-scale features to improve the segmentation performance. Among them, UNet3+ [21] is the most popular due to the full-scale feature fusion strategy. As illustrated in Fig. 3a, in each scale of the decoder, all lower-level feature maps in the encoder providing more multi-scale detailed information are directly fused to the higher-level feature maps in the decoder through skip connections. We argue Fig. 1 An example of the NIH pancreas dataset. The red region indicates a pancreas region which has a small size, irregular shape and boundary Fig. 2 Two main types of coarse-to-fine approaches that this fusion strategy is not optimal because there exists redundant pixel-wise detailed information in lower-level feature maps in the encoder that is not conductive to repair the details of segmentation object in the higher-level feature maps in the decoder. Therefore, we proposed the multi-scale feature calibration UNet (MFCUNet). As shown in Fig. 3b, where the multi-scale feature calibration gate (MFCG) in each scale queries the required feature information by interacting with the features of the previous layer, which not only suppress the irrelevant redundant pixel-wise detailed information in lower-level feature maps in the encoder but also calibrates the feature maps vertically. To our knowledge, we are the first to explore vertical calibration of multi-scale feature maps in a cross-layer manner to suppress the irrelevant noise. Note that the decoder in UNet3+ also has direct connections of multi-scale features. Since we study how to transfer accurate features from the encoder to the decoder, we remove the direct connections in the decoder of UNet3+.
The framework combining DEM and MFCUNet is our proposed cascaded multi-scale feature calibration UNet (CMFCUNet) and we conducted quantitative and qualitative experiments on the NIH pancreas segmentation dataset [15] to evaluate our method. Firstly, we demonstrate that MFCG is beneficial for calibrating the multi-scale features, and when leveraging MFCG in each scale, the segmentation performance improves by more than 2% on DSC and JI. In addition, we prove that using the DEM to connect the coarse-scaled segmentation and fine-scaled segmentation is superior to utilizing only one kind of connection. Furthermore, we compare CMFCUNet with four mainstream coarse-to-fine segmentation approaches, and the results show that our method outperforms existing coarse-to-fine pancreas segmentation methods. Finally, we compare CMF-CUNet with state-of-the-art pancreas segmentation methods in terms of both effectiveness and efficiency, and the results show that our method performs better. In addition to the experiments on the NIH dataset, we also conducted comparative experiments on the MSD pancreas dataset with strong baselines, and the experimental results demonstrate the excellent generalization of the proposed method.
In summary, our main contributions are three-fold: First, we proposed a novel coarse-to-fine pancreas segmentation framework, in which the dual enhancement module (DEM) not only crops the fine-scaled segmentation input to remove the background region but also effectively enhances the fine-scaled segmentation input with the context information of the coarse-scaled segmentation mask through joint training. Second, considering the direct fusion of multi-scale features to recover boundary details is redundant and inaccurate, MFCUNet was proposed to vertically calibrate the features in a pixel-wise fashion. In addition, we proposed a cascaded multi-scale feature calibration UNet (CMFCUNet) combining the advantages of MFCUNet and DEM, which achieves state-of-the-art pancreas segmentation performance.

Related work
Pancreas segmentation belongs to the field of medical image analysis and is the premise for further pancreas-related diagnosis [23]. However, it is not an easy task to segment the pancreas due to the large variabilities in size, shape and position. To alleviate these challenges, researchers have explored many approaches, which can be divided into two categories: traditional approaches and deep learning-based approaches.
Traditional pancreas segmentation methods are directly based on human intuition, such as relying on manual feature extraction or relying on human intervention in the segmentation process [1,2,[24][25][26][27][28][29][30][31]. Among these methods, atlas-based algorithms are the most popular [1,24,25,29,30], for example, Shimizu et al. [24] utilized atlas to perform coarse-scaled segmentation, and then combined morphological operations and ensemble classifiers to refine the segmentation masks. Similarly, Karasawa et al. [25] suggested applying the atlas algorithms for the coarse-scaled segmentation and then leveraging graph cuts [32] for the fine-scaled segmentation. Different from the above discussed atlas-based approaches employing the atlas algorithms only for coarse-scaled segmentation, studies [1,29,30] applied the atlas algorithm to conduct the fine-scaled segmentation according to the vascular structure and surrounding tissues. By summarizing, we found that the key to improving the performance of atlas algorithms depends on the fact that the generated atlases are preferably patient-specific. However, current atlas-based methods cannot handle the spatial and shape variation of the pancreas well in a patient-specific manner. There are also number of other pancreas segmentation approaches such as the statistical model-based approach [26], the superpixel-based approach [2], the intensity-based approach [27], the level set based approach [28] and the graph-based approach [31]. These methods either utilized statistical learning models, or leveraged super-pixel clustering, or based on pixel density distribution, or employed curve evolution, or leveraged the connectivity of image regions. However, there is a gap in pancreas segmentation performance of these methods compared to atlas-based methods.
With the development of deep learning, networks based on FCN [5] and UNet [6] have dominated the field of pancreas segmentation [14][15][16][17][33][34][35][36][37][38][39]. Compared with traditional approaches, they appear to have more powerful model capabilities to fit data distributions and be applied to pancreas segmentation. Roth et al. [15] firstly proposed a cascaded deep convolutional neural network (DCNN) to segment the pancreas which employed the superpixel method for the coarse-scaled segmentation and leveraged the DCNN for the fine-scaled and refined-scaled segmentation. Based on this, the researchers found that the application of DCNN to pancreas segmentation seems promising. Therefore, they reconsidered whether traditional pancreas segmentation methods could be replaced by DCNN. Since then, many DCNN-based methods have been proposed, which can be classified into two categories, one-stage segmentation and coarse-to-fine segmentation.
One-stage segmentation. Huang et al. [40] suggested combining UNet [6] and MobileNet-V2 [41] to obtain a lightweight network for pancreas segmentation and Cai et al. [33] proposed integrating DCNNs and recurrent network [42] to segment the pancreas. Both of these methods are based on 2D networks. Therefore, Fang et al. [43] proposed a globally guided progressive fusion network for pancreas segmentation where the encoder used 3D convolution to extract the features and the decoder utilized 2D convolution to reduce the memory cost. After that, Oktay et al. [35] and Mo et al. [37] further suggested employing a fully 3D network to segment the pancreas. Although, 3D networks are more powerful, however, due to the limit of computer memory, their inputs are usually patch-based, which makes the network unable to fuse information from a global view.
Coarse-to-fine segmentation. Compared to one-stage pancreas segmentation, coarse-to-fine approaches assumed that the coarse-scaled segmentation mask can be leveraged to focus on the pancreas region. Therefore, these approaches employed the coarse-scaled segmentation mask to localize the pancreas and then utilized the localized coordinates cropping the coarse-scaled segmentation input to conduct the fine-scaled segmentation. The coarse-to-fine pancreas segmentation are divided into 2D [14,17,34,38] or 3D approaches [16] according to the different dimensions of the backbones. All of these methods suffer from the inconsistency between the training stage and the test stage where the coarse-scaled segmentation and fine-scaled segmentation are separate in training and joint in testing. As a result, researchers proposed cascaded joint training methods [18,19], where the coarse-scaled segmentation mask is directly applied to the fine-scaled input without cropping. We argued that the cropping in separate training and the direct connections in joint training are complementary to some extent, thus we proposed the cascaded multi-scale feature calibration UNet (CMFCUNet) for pancreas segmentation leveraging both advantages of separate training and joint training.

Methods
In this section, we will introduce our cascaded multi-scale feature calibration UNet (CMFCUNet) in detail. As shown in Fig. 4, CMFCUNet aims to segment the pancreas in a cascaded (coarse-to-fine) fashion. The CT scans X C was first input to the coarse-scaled multi-scale feature calibration UNet (MFCUNet) f (X C , C ) , the output of which Y C was utilized to enhance the input of the fine-scaled segmentation X F via the dual enhancement module (DEM). After that, MFCUNet f (X F , F ) was again applied to refine the coarsescaled segmentation mask in the fine-scaled segmentation. The output of fine-scaled segmentation is denoted as Y F . We first introduce MFCUNet and then present the pipeline of CMFCUNet.

Multi-scale feature calibration UNet
UNet3+ (Fig. 3a) is a fundamental medical image segmentation network, including an encoder and a decoder that are symmetric. Both the encoder and the decoder have five scales and each scale possesses two consecutive 3 × 3 convolution kernels. Each kernel is followed by batch normalization and a nonlinear ReLU function. In the encoder, feature maps at each scale are down-sampled via 2 × 2 max pooling to extract progressively stronger semantical features. In the decoder, features at each scale are up-sampled via the 2 × 2 transposed convolution to gradually recover the resolution, and the final output of the decoder is input to a sigmoid function to obtain the segmentation mask. Due to the lack of boundary details during the process of recovering the resolution, a multi-scale skip connection is directly fused to transmit detailed information from the encoder to the decoder. But these multi-scale features are inaccurate due to lost information in down-sampling. In addition, like UNet3+, directly fusing lower-level features with higher-level features will result in feature redundancy. Therefore, we proposed multi-scale feature calibration UNet (MFCUNet) to solve Fig. 4 Cascaded multi-scale feature calibration UNet these two issues. As shown in Fig. 3b, MFCUNet differs from UNet3+ in that each scale utilizes a multi-scale feature calibration gate (MFCG) to remove noise and fuse useful information for higher-level features. MFCG is illustrated in Fig. 5. For the convenience of understanding, we utilize the fourth scale as an example to elaborate on the implementation process of MFCG and the realization process of other scales is the same as that of the fourth scale.
The fourth scale input of MFCG is the fourth scale output feature maps and the previous scale feature maps, denoted as x 4 en ,x 3 en , x 2 en and x 1 en . Each x i en (i = 1, 2, 3, 4) is then linearly projected to q i , k i and v i . Here, q i represents the ith scale features that need to query,k i represents the ith scale features that need to be queried (also called keys) and v i represents the ith scale feature values.
For example, q 4 vertically queries all scale keys (k 4 , k 3 , k 2 , k 1 ) to obtain the contribution weights of the fourth, third, second, and first scales (a 4,4 , a 4,3 , a 4,2 , a 4,1 ) to the fourth scale feature maps respectively. After that, the contribution weights are normalized via softmax, the output is denoted as b 4,4 , b 4,3 , b 4,2 , b 4,1 which are then multiplied with the v 1 ,v 2 , v 3 and v 4 . The final output x 4 de is the sum of all scale previous features and it is easy to see that the fourth scale output feature maps x 4 de is vertically calibrated in a pixel-by-pixel manner between multi-scale feature maps. The process is formulated as: where d k is the dimension of k i , and dim = 1 indicates softmax is conducted along the channel dimension.

Cascaded multi-scale feature calibration UNet
As mentioned above, there are two main kinds of coarse-tofine segmentation approaches. One trains the coarse-scaled segmentation and the fine-scaled segmentation separately, which cropped the coarse-scaled input as the input of the fine-scaled segmentation network. The other one trains the coarse-scaled segmentation and the fine-scaled segmentation (1) a 4,4 , a 4,3 , a 4,2 , a 4,1 = q 4 k 4 , q 4 k 3 , q 4 k 2 , q 4 k 1

Fig. 5
Multi-scale feature calibration gate jointly and the coarse-scaled segmentation mask is directly applied to the coarse-scaled input. After that, the results are input to the fine-scaled segmentation network. The main difference between separate training and joint training is that the separate training cropped the coarse-scaled input according to the localization of the coarse-scaled segmentation mask, resulting in the different size of the fine-scaled input, so that the coarse-scaled segmentation and the finescaled segmentation are required to be trained separately. While joint training directly applied the coarse-scaled segmentation mask to the coarse-scaled input and the size of the fine-scaled input does not change in the process so that the coarse-scaled segmentation and the fine-scaled segmentation can be trained jointly. We argue that separate training and joint training can promote each other to improve pancreas segmentation performance. Therefore, we proposed cascaded multi-scale feature calibration UNet (CMFCUNet) to leverage the advantages of both training strategies. CMFCUNet is shown in Fig. 4. The input of coarsescaled segmentation is X C and the output of coarse-scaled segmentation is denoted as Y C which helps generate the finescaled input X F via dual enhancement module (DEM). DEM is shown in the middle of Fig. 4. The output of coarse-scaled segmentation Y C was first input to the transform module t(Y C , ) to adapt itself and the output of the transform module is multiplied by the coarse-scaled input X C to softly suppress the complex background. Y C was also applied to compute the cropped coordinates. We first compute the center coordinates (x center , y center ) of the minimum bounding box of Y C . After that, according to the maximum height h max and width w max calculated by all volumes, which are described in the next subsection, the coordinates (x crop , y crop , w max , h max ) for cropping the coarse-scaled input can be calculated. The calculation process is as equation (4).
where (x crop , y crop ) is the coordinate of the upper left corner of the cropped region. The output of leveraging the coordinates (x crop , y crop , w max , h max ) for cropping the coarse-scaled input is the fine-scaled input X F . The pipeline of DEM is as follows: t(Y C , ) is complemented using two consecutive 3 × 3 convolution kernels with stride 1 and padding 1.

Training and testing
Training. There are two steps in the training process. As the input of the fine-scaled segmentation in the CMFCU-Net should be the same size, we first trained a plain UNet provided by the study [20] and then utilized the trained UNet to predict the results of all cases. After that, we calculated the bounding box of each case and found the maximum height on the x-axis and maximum width on the y-axis denoted as h max and w max , respectively. On the NIH dataset, h max and w max are set to 128 and 224 respectively.
In the second step, we trained the CMFCUNet. The coarse-scaled output was obtained through the coarse-scaled MFCUNet and then achieved the fine-scaled input via DEM. The cropping coordinate of coarse-scaled input is calculated by h max ,w max and the coordinate of each slice. The cropped coarse-scaled input is also the fine-scaled input which was again input to the MFCUNet to obtain the fine-scaled segmentation mask. The overall loss function consists of the coarse-scaled segmentation loss function and the fine-scaled segmentation loss function, calculated as Eq. (6): where the weight is set to 0.5. Y is the ground truth. Y C is the coarse-scaled segmentation mask and Y F is the finescaled segmentation mask.L DSC is the dice loss function formulated as: Testing. The testing process is different from the training process. We follow the research [14] and test recurrently. Specifically, the test CT scan is first fed into CMFEUNet to output a fine-scaled segmentation mask, which is again fed into the DEM. This process loops until the DSC of the last two segmentation masks reaches 95%.

Dataset and settings
We evaluate our method on the NIH pancreas segmentation dataset. The NIH pancreas segmentation dataset consists of 82 contrast-enhanced CT volumes and the size of each volume belongs to 512 × 512 × L where L ∈ [181, 466] . The voxel spacing along each axis of 82 cases differs greatly and thus we resampled each volume according to the computed median spacing which is 0.85, 0.85, and 1.0 for the x, y and z axes, respectively. We follow the research [44] by applying 3rd-order spline interpolation for the CT volumes and nearest-neighbor interpolation for the corresponding ground truth.
We follow the research [15] using fourfold cross-validation in a random hard-split of 82 cases for training and testing folds with 21, 21, 20 and 20 cases for each testing fold. The house unit (HU) values are limited to [−100, 240] according to the research [45], and the z-score is then leveraged to normalize the HU values to [0,1]. The input of our method is the slice along the z-axis, where the size is 512×512. And the batch size is set to 2. Stochastic gradient descent(SGD) with Nesterov momentum ( = 0.99) and an initial learning rate of 0.01 are used for learning network weights. The learning rate is decayed throughout the training following the 'poly' learning rate policy (1 − epoch∕epoch max ) 0.9 ,where epoch max = 1000.

Evaluation metric
We leverage Dice Similarity Coefficient (DSC), Jaccard Similarity Coefficient (JSC), precision, and recall to evaluate the pancreas segmentation performance. They are calculated as follows, where TP, FN and FP represent True Positive, False Negative and False Positive, respectively.
(1) DSC computes the volume coverage between the ground truth and the segmentation mask.
(2) JSC computes the similarity between the ground truth and the segmentation mask.
(3) Precision computes the true positive rate in the segmentation mask.
(4) Recall computes the true positive rate in the ground truth.

Ablation studies
In this section, we conduct ablation studies to analyze the effectiveness of the multi-scale feature calibrate gate (MFCG) and the dual enhancement module (DEM) both quantitatively and qualitatively.

Effectiveness of multi-scale feature calibration gate (MFCG)
Multi-scale lower-level features are beneficial to supplement the lost location information of higher-level features. However, some multi-scale lower-level features are redundant, and inaccurate higher-level features are not calibrated by multi-scale lower-level features. In this subsection, we explore the effectiveness of MFCG. As can be seen from Table 1, multi-scale features (MF) are helpful and improved the results on all evaluation indexes, especially for the recall which increased by 5.46% compared with the baseline. When adding MFCG, the DSC and JI increased by over 2% due to its ability to select the useful multi-scale lower-level features to calibrate the higher-level features. To qualitatively demonstrate the effectiveness of MFCG, we compare the third-scale feature maps of the encoders in UNet, UNet3+, and MFCUNet. As shown in Fig. 6, the green box marked the pancreas region. It is obvious that our model extracts clearer features of the pancreas region compared to UNet and UNet3+ because MFCG can automatically choose useful lower-level detailed features in a pixel-wise fashion to repair the higher-level features. Note that in Table 1, MFCG includes MF and rows 1, 2, and 3 in Table 1 corresponding to UNet, UNet3+ (Fig. 3a) and MFCUNet (Fig. 3b), respectively.

Effectiveness of dual enhancement module (DEM)
Here, we verify the effectiveness of the DEM. DEM includes two enhancement operations: multiplication and cropping. Cropping represents the cascaded model is trained separately and only position coordinates are utilized to crop the coarsescaled input, while multiplication indicates the cascaded model is trained jointly and the coarse-scaled segmentation mask is multiplied with the coarse-scaled input to focus on the relevant pancreas region. Note that as described in Sect. 3.3, the testing process is in a recurrent fashion and the fine-scaled segmentation represents the results that come from the last loop. As shown in Table 2, it can be seen that our method achieved the best performance leveraging the DEM both in coarse-scaled segmentation and fine-scaled segmentation, since the cropping method trains the coarsescaled segmentation and the fine-scaled segmentation separately, resulting in the fine-scaled segmentation lacking the context from the coarse-scaled segmentation. While multiplication trains the coarse-scaled segmentation and the fine-scaled segmentation jointly, and the fine-scaled segmentation shared the context from the coarse-scaled segmentation. However, joint training is insufficient for background suppression. It is also worth noting that the methods based on "Multiply" and "Crop" are essentially two mainstream coarse-to-fine segmentation methods. In addition, there are also two mainstream coarse-to-fine segmentation methods whose experimental analysis is shown in Sect. 4.4. We also illustrated the qualitative results of four cases, which are shown in Fig. 7. The green circles depict the part of 3D segmentation masks where our method leveraging DEM is obviously better than the multiplication and cropping method.

Comparison with the coarse-to-fine approaches
As mentioned above, our method (CMFCUNet) is a novel coarse-to-fine pancreas segmentation framework combining the advantages of separate training and joint training. In this part, we verify the effectiveness of CMFCUNet when compared with other mainstream coarse-to-fine segmentation methods. As described in Sect. 4.3.2, the results of the fine-scaled segmentation in Table 2 are the experimental results of two mainstream coarse-to-fine segmentation methods based on "Crop" and "Multiply". In this section, we present the experimental results of two other mainstream coarse-to-fine segmentation methods based on "Add" and "Concatenate". These four mainstream coarse-to-fine segmentation methods can mainly be divided into two categories. Here, we introduce the relevant representative methods.  ResDSNC2F [16] represents a type of coarse-to-fine segmentation method based on "Crop", where the coarsescaled segmentation and the fine-scaled segmentation are trained separately, and the fine-scaled input is the cropped coarse-scaled input. During separate training, the size of the feature maps for coarse-scale segmentation and fine-scale segmentation is different. While X-Net [46], casFCN [47] and Cascaded Dense-unet [48] represent another type of coarse-to-fine segmentation method based on "Multiply", "Add" and "Concatenate" respectively, where the coarsescaled segmentation and the fine-scaled segmentation are trained jointly, and the fine-scaled input is obtained by directly multiplying, adding or concatenating the coarsescaled segmentation mask to the coarse-scaled input. During joint training, the size of feature maps is constant. Note that in order to keep the consistency of the experimental settings, both the coarse-scaled segmentation network and the finescaled segmentation network in these mainstream coarse-tofine methods utilize MFCUNet. As shown in Tables 2 and  3, CMFCUNet outperforms other cascaded coarse-to-fine segmentation methods because our method first utilizes the coarse-scaled segmentation mask to enhance the coarsescaled input via multiplication and then crops the enhanced coarse-scaled input to the pancreas region, which enables the fine-scaled input to be continuously enhanced twice reducing the influence of the complex background. The results of all cases on four metrics are plotted in Fig. 8. It is obvious that the distribution of our results is overall better than other mainstream coarse-to-fine methods.

Comparison with the state-of-the-art approaches
To verify the effectiveness, we compared our results with baseline methods and state-of-the-art methods. The qualitative results are shown in Table 4. It can be seen that our method greatly improved pancreas segmentation performance, especially for JI by more than 3%. We also compared the average inference time per test case with other studies, from which we can see that our 2D coarse-to-fine segmentation network greatly reduces the inference time.
In addition, as can be seen from the last column of Table 4, our model greatly reduces the number of model parameters compared with most methods. Furthermore, the qualitative comparison of our method with the ground truth is shown in Fig. 9, where the red line and blue line indicate the ground truth and fine-scaled segmentation masks, respectively. It is easy to see that our fine-scaled segmentation boundaries are very close to the ground truth boundaries, both on the small pancreas slices and the large pancreas slices. Finally, we qualitatively compare the experimental results of our method with the results of strong baseline methods [17,43]. As shown in Fig. 10, the green circles depict the part of 3D segmentation masks where our method is obviously better than the strong baseline methods (Fig. 11).

Results on the MSD dataset
To verify the generalization of our method on pancreas segmentation, we conduct comparative experiments on the MSD dataset [51] with strong baseline models. The MSD contains 281 contrast-enhanced CT volumes, and we randomly divided the dataset into 4 folders for cross-validation.
Other settings are the same as on the NIH dataset.
The experimental results are summarized in Table 5. First, Roth et al. [49] proposed to use 3D FCN to coarsely localize the pancreas and then leverage the reduced region for fine-scaled segmentation. However, due to separate training, the fine-scaled segmentation lacks the context information of the coarse-scaled segmentation. Therefore, our method improved the average DSC by 0.7% and decreased the standard deviation by 2.15%. Different from the method proposed by Roth et al. [49]. Fang et al. [43] proposed to utilize both 2D and 3D convolutions in the encoder to extract rich spatial information and then leverage 2D operations in the decoder to predict segmentation masks. However, due to the small proportion of the pancreas in the input image, the one-stage segmentation network is easily confused by the complex and variable background. As can be seen from Table 5, the method proposed by Fang et al. [43] achieved the lowest pancreas segmentation performance.
Recently, Li et al. [50] found that pancreas segmentation based on encoder-decoder networks are easy to lose the boundary details of the target due to down-sampling, so they proposed to utilize a 3D UNet to localize the pancreas in the coarse-scaled segmentation and then utilize a high-resolution network to segment the pancreas in the fine-scaled segmentation to avoid the loss of boundary details. However, due to memory constraints, the input image needs to be down-sampled by a factor of 4 before being input to the high-resolution network, and the output needs to be up-sampled by a factor of 4 to restore the image resolution. Up-sampling and down-sampling in the high-resolution network lead to false pixel predictions in segmentation masks. Besides, the coarse-to-fine method   proposed by Li et al. [50] is trained separately and the fine-scaled segmentation also lacks contextual information from the coarse-scaled segmentation. Therefore, our method improves the pancreas segmentation accuracy by 1.19%, 1.9% and 2.13% on DSC, Max DSC and Min DSC respectively. In addition, our method has a smaller standard deviation and is more stable.
The qualitative analysis results are shown in 11, consisting of four samples, each of which marks the boundary of the predicted pancreas and the boundary of the corresponding ground truth. It can be seen that the segmentation results of our method are very close to the ground truth, no matter on the 2D slices with a small proportion of the pancreas or with a large proportion of the pancreas. Fig. 9 The results of our method compared with the ground truth on NIH dataset Table 4 Comparison of our method with state-of-the-art methods

Conclusion
In this paper, we proposed a novel cascaded segmentation framework, cascaded multi-scale feature calibration UNet (CMFCUNet), where a dual-enhancement module (DEM) utilizes cropping and joint training to enhance the fine-scaled input. Besides, the backbone of CMFCUNet is the proposed multi-scale feature calibration UNet (MFCU-Net), which calibrates higher-level features by vertically leveraging corresponding voxels of multi-scale lower-level features. Through ablation studies on the NIH pancreas  . 11 The results of our method compared with the ground truth on MSD dataset dataset, we first demonstrate that utilizing DEM to connect the coarse-scaled segmentation and the fine-scaled segmentation outperforms utilizing a single connection, and we also demonstrate that MFCUNet is able to calibrate multi-scale higher-level features. Besides, we compared our method with four mainstream coarse-to-fine segmentation approaches, and it turns out that our framework is superior. In addition, we compared our method with baseline and the state-of-the-art methods, and the results indicate that our method has advantages in DSC JI, inference time, and model parameters. We also experimentally verify the superiority of our method over the strong baselines on the MSD dataset.