Multi-path Aggregation U-Net for Lung Segmentation in Chest Radiographs

Lung segmentation from chest X-ray images is a fundamental and crucial step for computer-aid diagnosis (CAD) system. Although many techniques for this problem have been proposed, it still remains as a challenge. Recently, Fully Convolutional Networks (FCNs) especially U-Net has been hugely successful for many image segmentation tasks. In this paper, we propose a revised variant of U-Net, specifically, we design two main components. The first component is multi-path dilated convolutions with different dilation rate to extract multi-scale features. It was used to replace the basic convolutions used in the original U-Net. The second component is skip connections with dense deep layer aggregation to further aggregate features across different scales. We perform extensive experiments on three publicly available datasets (in total 951 images). Our proposed method outperforms many other segmentation methods and achieves state-of-the-art segmentation performance (Dice’s coefficient of 96.5%, 97.9% and 96.7% on the three datasets, respectively).


Introduction
As a cost-effective and most common used medical imaging method, chest radiographs provide extensive and valuable information for diagnosis of many lung diseases, for example, lung cancer, pneumonia and pneumothorax. Accurate lung segmentation is a critical and fundamental step for Computer-Aided Diagnosis (CAD) of most lung diseases in automatic analysis of chest radiographs. The accuracy of lung segmentation methods greatly affects the accuracy of downstream tasks. Although many lung segmentation approaches have been presented, this task still has some unsolved problems and challenging scenarios. For example, the shape of lung area is highly related to gender, age, and health condition of patients. In addition, segmentation of lung in chest radiographs is much more difficult when there exist some external objects such as sternum wire, heart pacemaker and surgical clips [1] .
Over the past decades, a large variety of traditional lung segmentation methods have been proposed. These lung segmentation approaches can be roughly grouped into four types: rule based, deformable model based, pixel classification based and hybrid methods. These methods have achieved certain results, however, they have some disadvantages and limitations such as heavily rely on handcraft features and often lead to over-segmentation and under-segmentation [19] .
Recently, deep learning techniques are undergoing a renaissance, due to the large amount of data for training and high performance computation power [16] . Convolutional Neural Networks (CNNs), as one important component of deep learning, have dominated the computer vision field. For image segmentation task, Fully Convolutional Network (FCN) [18] is one of the first neural networks which adopts fully convolutional network to tackle image segmentation problem. Based on FCN, U-Net [22] is a segmentation network which was specifically designed for medical images. Since U-Net was proposed, it has been extensively adopted for segmenting medical image and achieved superior results [9] . There have been numerous segmentation works which are based on U-Net and some innovations or improvements are made upon. Li et al. [17] proposed a hybrid U-Net based on the dense connection which was introduced in DenseNet [11] for tumor segmentation from CT images. Zhou et al. [29] proposed a nested U-Net architecture to better extract semantic features. Ibtehaz et al. [12] designed MultiResUNet which is an enhanced version of U-Net and showed excellent performance. On the other side, many approaches for lung segmentation have been proposed, for example, Inf-Net [5] adopts a segmentation network based on a parallel partial decoder and dual attention mechanism for CT images.
Inspire by the above-mentioned research works, we propose a novel lung segmentation approach in chest X-ray images based on CNNs. Specifically, we design an encoder-decoder style segmentation network on the basis of U-Net. In order to extract multi-scale image features, we design a multi-path convolution block named Multi-Path Dilated Residual Block (MPDRB) to replace the original convolution block in U-Net. In addition,with the purpose of enlarge the receptive field of convolution operations in bottleneck layers, we introduce Dilated Residual Block (DRB) which utilizes dilate/atrous convolution 27 to accomplish that. Although some previous works have introduced atrous convolution for medical image segmentation, for instance, Hesamian et al. [10] propose a deep residual CNN with atrous convolution for lung nodule segmentation from CT images, the effect of atrous convolution for lung segmentation from X-ray images are still not well exploited and studied. Moreover, motivated by Yu et al. [28] , we extend the original skip connections in U-Net with Dense Deep Layer Aggregation (DDLA). We evaluate our proposed models on three publicly available chest X-ray image datasets and show that our proposed approach achieves state-of-the-art performance compared with other segmentation methods.
The remaining of the paper is organized as following: Section 2 describes the details of our proposed method. Section 3 shows the results of the experiments and the conclusion is given in Section 4.

Overview
Our proposed lung segmentation method is an end-toend deep learning framework. The only pre-processing step is just histogram equalization and no post-processing is needed. During training phase, the chest X-ray images after histogram equalization and the associated lung segmentation mask are resized and feed into our deep CNN model. In the testing phase, once the model was trained, the lung segmentation map can be directly obtained through prediction of trained model and no other processing steps are needed.

Pre-processing
The pre-processing step in our approach including histogram equalization and image resizing. Since the pixel intensity is varies greatly for some chest X-ray images, histogram equalization operation is applied for every input chest radiographs to enhance the image contrast.
Histogram equalization operation is one of the most widely used and classical method for image enhancement. The main concept of histogram equalization is to convert the histogram distribution of an image into an approximately uniform distribution, thereby enhancing the contrast of the image. Specifically, the cumulative distribution function of the input image is firstly computed based on the probability distribution of pixels. Afterwards, the equalized image can be produced by a transformation: (1) where k is the input gray level and Sk is the output gray level. The total number of gray levels is represented by L and the total amount of pixels is indicated by N.

Network Architecture
The overall architecture of our network is illustrated in Figure 1, which is mainly based on encoder-decoder fully convolutional neural networks (e.g. U-Net), with two principle significant improvements: multi-path dilated convolution and dense deep layer aggregation.
In order to illustrate the improvements of our proposed network more clearly and efficiently, Figure 2 shows the architecture of the conventional U-Net. Compared with the conventional U-Net which uses two consecutive 3 × 3 convolutions to extract semantic features, we propose multi-path dilated convolutions with diverse dilation rate to enlarge the receptive fields of convolution operations thus can better learn features at multiple scales. As shown in Figure 1, three different building blocks of multi-path dilated convolutions were presented: Multi-Path Dilated Residual Block (MPDRB), Multi-Path Block (MPB) and Dilated Residual Block (DRB).
In the original U-Net, the skip connection was utilized to fuse image features between encoder and decoder at corresponding scales. However, these connections are shallow and linear [28] . There have many previous works intend to tackle this issue [14] . Based on these works, in our approach we present a novel feature aggregation paradigm: dense deep layer aggregation to further strengthen the skip connection.

Multi-path Dilated Convolution
In the original U-Net, for every convolution block, two successive 3 × 3 convolutions are used both for the encoder and decoder. However, it has some disadvantages such as it may not capable of extract spatial multi-scale features. In addition, in the case of the resolution of input image is large (such as 256 or 512), the receptive field of the convolution operation in bottleneck layers may not cover the entire input image, thus may not very efficient to segment objects which is pretty large.
Considering these above reasons, we specifically design three components based on dilated convolutions. In particular, the first one is MPDRB in which we design two branches with different dilation rates and one residual connection. The first branch employs two ordinary convolutions to small scale semantic features. The success of design philosophy of VGG network [25] indicate that the effect of successive convolution operations with small kernel (e.g. 3 × 3) is equivalent to the effect of convolution operations with large kernel (e.g. 5 × 5 ), but with less computation and parameters. For example, the receptive field of two successive 3 × 3 convolutions is the same with the receptive field of one 5 × 5 convolution, but with 28% less computation.  Meanwhile, in order to extract semantic features of large objects, we also design dilation branch, which is convolutions with dilation rate 2. Figure 3 illustrates the receptive fields of these two branches used in MPDRB, where Figure 3 (a) shows that two successive ordinary 3 × 3 convolutions which has a receptive field of 5, and Figure  3 (b) demonstrates that two successive dilated convolutions with dilation rate 2 has a receptive filed of 9. For the residual connection in MPDRB, we adopt 1 × 1 convolution motivated by the work of He et al. [8] .
The second one is Multi-Path Block (MPB) which is MPDRB without the residual connection. The last one is DRB which we stack four dilated convolutions with dilation rate 1,2,4,8 respectively. The receptive field of one convolution operation in the DRB is 31, which is illustrated in Figure 4. Even in the case of the resolution of input image is 512, this receptive field can approximately cover the entire input image (the resolution of the features in bottleneck layer is 32).

Dense Deep Layer Aggregation
The main purpose of skip connection in the conventional U-Net is merging features at corresponding scales between encoder and decoder. Additionally, it also provides shortcut path for back propagates gradients. Nevertheless, the skip connection is just pure copy and paste, therefore it is still linear and shallow [28] . In [14], the skip connection is extended by DLA. However, it eliminates the shortcut path for gradients propagation. In conclusion, we design DDLA which combines DLA and dense connection. Specifically, direct connections are introduced between any layer to all following layers in DLA.

MC Dataset
The MC dataset consists of 138 frontal chest radiographs and each has either 4892 × 4020 or 4020 × 4892 resolution. This set collected over many years from the department of Health and Human Services, Montgomery County, Maryland, USA. The corresponding lung segmentation golden standard are annotated under the supervision of experienced radiologists.

JSRT Dataset
The JSRT dataset contains 247 chest X-rays and was gathered from 14 medical centers and assembled by the Japanese Society of Radiological Technology (JSRT). All images are with 2048 × 2048 resolution. The annotated lung segmentation mask was performed and introduced by Ginneken et al. [6] .

Shenzhen Dataset
The Shenzhen dataset is collected from Shenzhen No.3 People's Hospital, Shenzhen, China. In total 662 images are in this dataset. The dimensions of these images are varied but around 3000 × 3000 . The corresponding lung segmentation mask are provided by Stirenko et al. 26 . However, only a portion of images (566) have their associated lung segmentation mask, therefore only 566 images from this dataset are used in our experiments. Table 1 summarizes the details of each dataset used in this study and some visual examples of these three datasets are illustrated in Figure 5. The example images in Figure 5 were obtained through the weighted summation of the green channel of the chest image and the associated lung segmentation mask image.

Evaluation Metrics
For the sake of performance evaluation of our proposed approach and make a comparison with other lung segmentation methods, we used three evaluation metrics in this study: the Jaccard similarity coefficient (JSC, also referred as overlap measure), and Dice's score (DCS).

Dataset
Number

Implementation Details
The data splitting protocol for all the three datasets is the same, specifically, the portion of data used for training, validation and testing is 50%, 10% and 40%, respectively. In order to achieve reliable and steady performance results and eliminate the dataset random splitting bias, we randomly generate 10 different split-ting training/validation/testing sets. Following previous studies, the resolution of all input images is resized to 256 × 256 . Since the pixel intensity is varies greatly for some images, the histogram equalization operation is applied for every input chest radiographs as the pre-processing step.
Since we made two main improvements compared with previous works, we implement three segmentation models, namely, MPDC U-Net which represents the U-Net extended with multi-path dilated convolution, DDLA U-Net which stands for the U-Net extended with dense deep layer aggregation, and MPDC DDLA U-Net which combines both. Moreover, we also implement the DLA U-Net model which is the U-Net extends DLA as described in [14] .
All of the models were implemented by PyTorch framework [20] and optimized by Adam optimizer with learning rate 1e-3 which is suggested by the original literature [15] . The weights of all models were initialized by MSRA initialization strategy [7] and trained from scratch. The batch size for both training and validation is 4. In order to prevent overfitting and accelerate convergence during model training, we use numerous strategies including Dropout with drop rate 0.2 and early stopping with patience 15 epochs. We use the model which achieves the lowest validation loss to evaluate the performance on test set.

Study on Loss Functions
The design and selection of different loss functions is critical for the performance of deep learning segmentation models. The loss function is used to measure the error between prediction and ground truth which can be back propagated to network layers and update the weights. For segmentation problems, Cross Entropy (CE) loss and Dice loss are the most two widely used options for loss function. In this study, we investigate these two above-mentioned loss functions in our model. CE loss perhaps is the most widely used loss function for image segmentation, it is a pixel-wise based loss function. This loss examines each individual pixel separately and compares the class prediction vector with the ground truth class vector. In binary segmentation situation, the formal definition of CE is: where y i and y î are the ground truth segmentation and predicted segmentation, respectively. The pixel of image is indexed by i.
Dice loss is another popular loss function which is based on overlap measure between the ground truth and prediction for image segmentation tasks. The Dice coefficient is often used to evaluate the segmentation performance and ranges from 0 to 1. The Dice loss is defined as: In order to find the most suitable loss function for the lung segmentation task, we firstly design experiments on Shenzhen dataset using our proposed model. Specifically, we use three loss functions: CE loss, Dice loss and their addition, then we trained three models with the corresponding loss. The accuracy curves on validation set are shown in Figure 6, in which we can clear observe that the combination of CE and Dice loss achieved the highest accuracy. Therefore, for the rest of experiments, we use the addition of these two losses for all our proposed models: = + (6) Figure 6: Accuracy curves on validation set for different loss functions.

Results and Discussion
The performance comparison of our proposed approach with other methods on the MC dataset, JSRT dataset and Shenzhen dataset is illustrated in Table 2. We make extensive comparison with many renowned segmentation networks such as FCN-8 [18] , U-Net [22] and Deeplabv3 [4] . The results of these three models are from our implementation. We also make a comparison with some recently proposed approach for lung segmentation like Candemir et al. [3] , Rashid et al. [21] and Novikov et al. [19] . The performance results are directly come from the original literature. We also compare with some medical image segmentation such as MultiResUNet [12] and AG-net [23] . The performance results of these two methods are also from our implementation. A visual comparison of lung segmentation results is demonstrated in Figure 5.  From these three tables, we can observe that our best performing model (MPDC DDLA U-Net) reached 94.8%, 95.6%, 92.9% in terms of JSC and 96.5%, 97.9%, 96.7% in terms of DCS on the three datasets, respectively. Compared with the performance of our baseline model (U-Net), our proposed best model increases the JSC with 1.7%, 1.1%, 0.5%, and DCS with 1.3%, 0.5%, 0.9% on the MC, JSRT and Shenzhen dataset. Notably, our best model achieved the highest score in terms of all two metrics on all the three datasets, surpassing numerous other methods including renowned segmentation models and recently proposed lung segmentation approaches.

Figure 7: A visual comparison of lung segmentation results between U-Net and our proposed MPDC DDLA U-Net
model. Further, even the performances of our models with just one improved component (MPDC U-Net (ours) or DDLA U-Net (ours)) are higher than the majority of other methods. Specifically, our proposed model with multi-path dilated convolution (MPDC U-Net) performs significant better than the original U-Net, which proves that the multipath dilated convolution can improve the lung segmentation performance against the original U-Net. In addition, our DDLA U-Net model performs better than U-Net extends with DLA consistently, which shows that the benefits of adding dense connections into DLA.
To conclude, these results indicate that by incorporating multi-path dilated convolution and dense deep layer aggregation into the original U-Net will significantly improve performance for lung segmentation in chest radiographs.

Conclusion
In this paper, we propose a multi-path aggregation U-Net for segmenting lung in chest X-ray images. Specifically, we propose two improvements upon U-Net, the first one is multi-path dilation convolution with different dilation rate, this enables the model has the ability to extract multi-scale semantic features and the receptive field of the model can cover the whole spatial dimension of input image. The second one is dense deep layer aggregation, which further strength the original skip connections in the U-Net. The extensive experiments on three publicly available datasets indicate that our proposed method achieves the best performance compared with other standard segmentation models and chest X-ray lung segmentation methods. One limitation of our approach is that it only performs one task, i.e. lung segmentation. In the future, we plan to integrate with more downstream tasks such as pneumonia classification to provide more comprehensive pipeline for CAD of chest radiographs.

Ethics approval and consent to participate
Our manuscript does not involve research manuscripts of human participants, human data or human tissues, so our manuscript does not require the statement of ethical approval and ethical consent.

Consent for publication
Our manuscript does not contain any individual person's data in any form, so we do not need the consent of others.

Data Availability
The data included in this paper are available without any restriction.

Conflicts of Interest
The authors declare that they have no conflicts of interest to report regarding the present study.

Funding Statement
This work was supported by the Research Foundation for Outstanding Young of Education Bureau of Hunan Province (No.18B571).