A Unet-based Research on the Multi-Output Convolution Neural Network's Ability of Decreasing Mis-identification: Automatic Segmentation of Organs at Risk in Thorax

Background To study a multi-output convolution neural network (CNN)’s capability of reducing mis-identification. To guarantee that the CNN’s output number was the only experiment variable, we used Unet as research object. By modifying it into a multi-output (MO) one, we got a MO-Unet and the conventional single-output Unet (SO-Uent) as a comparing object. All images involved in this study were computed tomography (CT) scans coming from 105 patients with thoracic tumor. 3 organs at risk (OARs), i.e. lung, heart and spinal cord, were delineated by experienced radiation oncologists and were used as ground truth. The two models were both trained with 1240 CTs (856 images for learning and 384 images for monitor) and under the same learning settings. They were both tested on other 886 images. Dice and mis-identification pixels’ number( n ) were 2 metrics for evaluation. (average ± standard


Abstract Background
To study a multi-output convolution neural network (CNN)'s capability of reducing mis-identification.

Material and Methods
To guarantee that the CNN's output number was the only experiment variable, we used Unet as research object. By modifying it into a multi-output (MO) one, we got a MO-Unet and the conventional single-output Unet (SO-Uent) as a comparing object. All images involved in this study were computed tomography (CT) scans coming from 105 patients with thoracic tumor. 3 organs at risk (OARs), i.e. lung, heart and spinal cord, were delineated by experienced radiation oncologists and were used as ground truth. The two models were both trained with 1240 CTs (856 images for learning and 384 images for monitor) and under the same learning settings. They were both tested on other 886 images. Dice and mis-identification pixels' number(n) were 2 metrics for evaluation.

Results
MO-Unet and SO-Unet achieved Dice of 0.9400 ± 0.0612 (average ± standard deviation) and 0.9451 ± 0.0618 for lung, 0.9143 ± 0.1119 and 0.9160 ± 0.1071 for heart, 0.8988 ± 0.0657 and 0.9020 ± 0.0624 for spinal cord respectively. The two models' all average Dices were ≤ 0.005. For the normalized number of cases with n = 0, MO-Unet and SO-Unet had 97.29% and 96.84% for spinal cord, 88.49% and 90.86% for heart, 81.26% and 77.09% for lung respectively. Compared to SO-Unet, the mis-identification cases of MO-Unet mainly felled in the range of small n.

Conclusions
The Dice results showed that the two models had comparable overlap. The n results suggested that the MO-Unet was better in decreasing mis-identification. Besides, a MO network is light-weighted to implement more delineation under the same computing source. Therefore, a MO network is promising in segmenting OARs and has the potential for a widespread application in China.
In recent years, with the development of convolution neural network (CNN) in the field of image, more and more networks [11][12][13][14][15][16][17] appear to segment OARs in CT images automatically and show good results. Feng X et al. [12] used three-dimension (3D) Unets to locate thoracic OARs and then segment them. They reported mean Dice of 0.89, 0.97, 0.93 and average 95% Hausdorff distance of 1.89 mm, 4 mm, 2.10 mm for spinal cord, lung and heart respectively. Tao He et al. [11] proposed a U-like network trained under a multi-task learning scheme. The major task was segmentation. The auxiliary task was global slice classification under the hypothesis of OARs concurrently appearing in similar slice orders for most patients. This U-like network reached heart Dice of 0.95. Besides, a 5-channel CNN with multiple images of highlighting different tissues achieved average heart, spinal cord and lung Dice of 0.91, 0.76 and 0.95 [16]. A Unet-GAN [17] attained mean Dice of 0.85, 0.96 ~ 0.97 and 0.88 for heart, lung and spinal cord.
Among the above networks, Unet [18] is a classic one for good image segmentation. When we trained a two-dimension(2D) Unet to delineate thoracic OARs, a few exceptions with mis-identification as shown in Fig. 1 appeared. In this figure, some pixels are wrongly categorized into OARs. This phenomenon may be caused by the CNN principle. CNN conducts each pixel's classification as a separate task and only based on the gray distribution of a small-size image (i.e. receptive field). There is no prior knowledge, such as shape or identification of neighboring organ, involved in the classification [19]. Therefore, it may give a wrong classification when different organs, in a receptive field, shows similar gray value. As shown in Fig. 2, the lung pixels in the heart pixel's receptive field (r heart ) and the vitro air pixels in the arm pixel's receptive filed (r arm ) both exhibit 0 grayscale.
Besides, the heart pixels and the arm pixels show similar image intensities. Without knowing the 0 grayscales in r heart and in r arm belong to lung and vitro air respectively, it is a high likelihood that CNN wrongly classified the pixel B into heart.
To decrease the above mis-identification, a possible solution is to label various organs by using different numbers. Given that the common activation function of rectified linear unit (ReLU) [20] has no upper limit of its output and hence it may map an image intensity to any positive number, we can't use a preassigned number to represent an organ. Therefore, we built a multi-output (MO) network whose different outputs corresponde to different organs. In this way, the network could learn an optimal number to represent a certain organ. The design of MO network is also known as multitask [21,22] (MT) or multi-label [12,23] (ML) learning. They have been reported in some papers [12,21,23], but most of them focused on the overlap between model result and ground truth, less investigated their performances in reducing mis-identification.
To verify the above hypothesis, we modified a classic 2D Unet into a multi-output one (abbreviated as MO-Unet) and trained it for segmentation of three thoracic OARs (i.e. lung, heart and spinal cord).
Then we compared its performances with a single output 2D Unet (abbreviated as SO-Unet). The reset of this paper is organized as follows. The results are shown and discussed in Sect. 2 and 3 respectively. Our conclusion is presented in Sect. 4. Section 5 gives the detailed architecture of MO-Unet and introduces our experiments. Table 1    It has attracted more and more interest to reduce duplicate clinical work by using CNN to segment OARs automatically. Along with its convenience, the security is also supposed to gain attention. Given that OAR delineation is used to optimize and evaluate a radiation treatment plan quantitatively, misidentification is a non-negligible factor, especially for a serial organ, such as spinal cord. When any subunit of a serial organ is irradiated by a dose of more than tolerance, the entire organ would fail. To the best of our knowledge, less papers reported the study of CNN's mis-identification. Given that the segmentation by CNN is actually a pixel-wise classification, some deviation-related metrics [24], such as 95% Hausdorff distance [15,25], the average Hausdorff distance [26] and mean surface distance [15,[27][28][29], can't perform a comprehensive evaluation on the mis-identification. How many wrong classifications, i.e. n in this paper, conducted by a network can help us learn more about its performance.

Results
In this work, we give a possible explanation for the wrong identification conducted by CNN, and propose a multi-output architecture as a potential solution. For validation, we adopted a classic segmentation network, namely Unet, as our research object.
The Dice statistics summarized in Table 1  Additionally, a MO network can segment OARs simultaneously. Therefore, compared to a SO one, it lowers the computing source when performing the same amount of delineation work. Consequently, a MO one seems to be a better choice for a widespread application in China, since most hospitals in China can't afford a high configuration computing server.

Future work
This work focused on the potential of reducing mis-identification by using a MO architecture. Except for this method, a three-dimension (3D) network may contribute to it too, as it provides 3D features.
In the future work, we will further investigate a 3D network's capability of decreasing misidentification and compare it with a MO one.

Conclusion
In the study, we proposed a multi-output architecture as a potential solution to decrease misidentification. We modified a classic CNN in image segmentation, i.e. Unet, into a multi-output one (MO-Unet) and compare its performance with conventional Unet (SO-Unet) under the same learning settings. 2 metrics was adopted in our work: Dice and mis-identified pixels' number. The results showed that MO-Unet was able to achieve similar statistical Dice as SO-Unet and performed better in decreasing mis-identification. Besides, compared to SO-Unet, MO-Unet is a light-weighted network to implement the same segmentation workload. In conclusion, a multi-output network has the potential to segment OARs with high accuracy and low mis-identification. It is also a promising way for a broad application.

Method 5.1 Networks
The SO-Unet is an open-source network[30]. Its detailed architecture was shown in Fig. 4(a).
Compared to the original Unet proposed by Olaf Ronneberger et al. [18], the filter number in each layer was reduced by half. It was constrained by the computational ability of hardware. In the expansive path, the transposed convolution[31] was adopted, instead of an upsampling of the feature map followed by a 2 × 2 convolution. All convolutions in this network were padded convolutions to guarantee the same size between input and output. We didn't adopt the overlap-tile strategy reported by Olaf Ronneberger et al. [18], because the input image in our work was different with theirs. In the work of Olaf Ronneberger et al., the input image was cell image whose border pixels had non-zero gray values. By using padded convolutions on these images, other zeros would be added to the input image border. It may change the grayscale distribution of the image border's receptive field and hence decreases the segmentation accuracy along the border. Differently, our input image was a thoracic CT whose border pixels usually equaled to zero (namely vitro air). Thus, the zero-padding operation wouldn't lead to the change of grayscale distribution in the receptive field of image border.
The last convolution used activation function of sigmoid to achieve binary classification. All other convolutions used activation function of ReLU.
According to the possible explanation about mis-identification stated in the section of introduction, we added another two output branches to the last hidden layer of SO-Unet to constitute MO-Unet (as exhibited in Fig. 3(b)). That is, one MO-Unet produced three OARs' segmentations simultaneously.
Three SO-Unets outputted three OARs' delineations respectively. In this way, the MO architecture is the only variable in our investigation. Consequently, the experiment results can help us assess whether the MO architecture can reduce mis-identification. In the meantime, we let the numbers representing different organs are learnable.

Experiments 5.2.1 Data acquisition and preprocessing
There were totally 105 patients with tumor in thoracic region enrolled in this experiment. All of them received CT scans by using a Light Speed (GE Healthcare, Chicago, America) or a Brilliance CT Big Bore system (Philips Healthcare, Best, the Netherlands). All CT images were reconstructed using a thickness of 5 mm and a matrix size of 512 × 512 with a resolution of approximately 1 mm. Among the acquired CT images, only 2126 images that encompassed lung, heart and spinal cord were involved. All OARs were delineated by experienced radiation oncologists and were regarded as ground truth.
In each CT image, we linearly converted the stored pixel value ranging of -135 ~ 215 HU into image intensity of 0 ~ 255. To save computing sources, all images were cut into 512 × 256 to remove unnecessary external air information. It is achieved by using a threshold segmentation (threshold value = 0) and border following algorithm[32] to detect the foreground edge, and then cutting the image into 512 × 256 based on the foreground center.

Network training
Among all images, 856 images were in the training set and 384 images were in the validation set. The rest 886 images were used for test. During training, the images in the training set were used to optimize network. Those in the validation set were used to monitor the whole process of training. The test set were used for evaluation and comparison.
In our experiment, MO-Unet and SO-Unet were both trained using adaptive moment estimation [33] (Adam) and the same parameters. The learning rate was 10 − 3 , batch size equaled to 16

Evaluation
To compare the performances of MO-Unet and SO-Unet, 2 metrics were adopted: Dice and misidentified number (n). They were measured in 2D.
Dice was a metric of measuring the spatial overlap between two sets of binary segmentations, as defined in Eq. (1).
n was the number of mis-identified pixels. A pixel was categorized as a mis-identified one when its minimum distance (d(x)) to any pixels in Y was greater than a threshold distance: in which || || meant the Euclidean distance. # represented number. T (unit: pixel) was the threshold distance. In our work, T = 20 for lung, T = 15 for heart and T = 5 for spinal cord. X and Y had the same denotations in Eq. (1).
T was set to avoid the influence of the non-pixel level delineation by human and the variance of inter-

Consent for publication
Not applicable

Availability of data and materials
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

Competing interests
The authors declare that they have no competing interests.   Figure 1 Illustration of mis-identification detected by Unet. Yellow arrows denote the misidentification.

Figure 2
Illustration of convolution neural network (CNN) principle and possible reason of misidentification. rheart and rarm denote two different organs' receptive fields.  Supplementary Files