A Generative Skull Stripping of Multiparametric Brain MRIs Using 3D Convolutional Neural Network

Accurate skull stripping helps following neuro-image analysis. For computer-aided methods, the presentence of the brain skull in structural MRI impacts brain tissue identification, which could result in serious misjudgment, especially for patients with brain tumors. Though there are some existing works on skull stripping in literature, most of them either focus on healthy brain MRI or only apply for a single image modality. These methods may be not optimal for multiparametric MRI scans. In the paper, we propose an ensemble neural network (EnNet), a 3D convolutional neural network (3DCNN) based method, for brain extraction of multiparametric brain MRI scans. We comprehensively investigate the skull stripping performance by using the proposed method on a total of 15 image modality combinations. The comparison shows that using all modalities provides the best performance on skull stripping. We have collected a retrospective dataset of 815 cases with glioblastoma at the University of Pittsburgh Medical Center (UPMC) and The Cancer Imaging Archive (TCIA). The ground truths of the skull stripping are verified by at least one qualified radiologist. The quantitative evaluation gives an average dice score coefficient and Hausdorff distance at 95th percentile, respectively. We also compare the performance to the state-of-the-art methods/tools. The proposed method offers the best performance. The contributions of the work have five folds: First, the proposed method is a fully automatic end-to-end for skull stripping using a 3D deep learning method. Second, it is applicable for multiparametric MRI (mpMRIs) and is also easy to customize for a single MRI modality. Third, the proposed method not only works for healthy brain mpMRIs but also pre-/post-operative brain mpMRIs with GBM. Fourth, the proposed method is capable to handle multicenter data. Last, to the best of our knowledge, we are the first group to quantitatively compare the skull stripping performance using different modalities. Keywords-Skull stripping, brain extraction, glioblastoma, 3D convolutional neural network, multiparametric MRIs


I. Introduction
In the U.S., there are about 23 per 100,000 population diagnosed with brain tumors during 2011-2015 [1]. Gliomas, originate from glial cells, are the most common primary brain malignancies, with varying degrees of aggressiveness [2]. To make proper treatment planning, accurate brain tumor detection and segmentation are strongly demanding. Due to time-consuming, inter-rater prone error, and low efficacy, manual brain tumor segmentation by radiologists is very challenging, and not feasible for large-scale data [3]. Therefore, automatically computer-aided brain tumor segmentation/detection is highly desired [3][4][5][6][7][8][9]. However, the high-resolution brain magnetic resonance image (MRI) contains some non-brain tissues, such as eyeball, skin, neck, skin, and muscle [10]. The presentence of the non-brain tissues is one of the major challenges for automatic brain image analysis. The non-brain tissues removal is a typical preprocessing step for most brain MRI studies, e.g., brain volumetric measurement [11], brain tissue segmentation [12], assessing schizophrenia [13], and Alzheimer's disease [14]. Consequently, before applying for an automatic computational technique for brain MRI studies, skull stripping is a prerequisite for brain imaging analysis [15].
As a preprocessing step, skull stripping, aka brain extraction, is to remove the skull and other non-brain tissues out from the MRI scans. It reduces human rater variance and eliminate time-consuming manual processing steps that potentially impede not only the analysis but also the reproducibility of large-scale studies [16]. The quality of skull stripping can be affected by several reasons, including imaging artifacts, MRI scanner, and acquisition protocol, etc. Furthermore, variability of anatomy, age, and the extent of brain atrophy, has impact on skull stripping as well [17]. The problem becomes more complicated when considering MRI scans with pathological conditions, such as brain tumors. Brain tumors change the presentence of the intensity in MRI. The situation could become worse when dealing with post-treatment of the MRI with brain tumors, especially with resection surgery. The cavities resulting from resection not only change the reflection of intensity but also alters the brain anatomy. All these factors above undermine the performance of skull stripping.
We argue that a good skull stripping leads to a good following-up brain analysis. Therefore, in the paper, we propose a 3D deep neural network-based method for skull stripping. The contribution of this work includes: first, it is a fully automatic end-to-end technique for skull stripping using a 3D deep learning method; second, it is applicable for multiparametric MRI (mpMRIs) and is also easy to customize for a single MRI modality; third, it works not only for healthy brain MRI, but also for pre-/post-operative brain MRI with a brain tumor; third, the proposed method applies to multicenter data; fourth, as the best of our knowledge, we are the first group to quantitatively compare the skull stripping performance using different modalities.

II.
Previous Work There are lots of skull stripping methods proposed in literature. These methods can be broadly classified into four categories: morphology-based, intensity-based, deformable surface-based, and atlas-based [10]. The morphologybased methods utilize a morphological erosion and dilation operations to remove skulls from the brain. Brummer et al. propose an automatic skull stripping on MRI using a morphology-based method [18]. It combines histogram-based thresholding and morphological operations for skull stripping. Similar work proposed in [19], authors perform a 2D Marr-Hildreh operator to achieve edge detection, then employ several morphological operations for skull stripping. However, it is difficult to find the optimal morphology-based method. In addition, the proposed methods are sensitive to small data variations. Proper thresholding and edge detection are the challenges for these methods. For intensitybased methods, they separate the brain and non-brain according to the image intensity. A typical technique of the method is a watershed algorithm. The watershed algorithm extracts foreground and background, and then using markers to make watershed run and detect the exact boundaries. Hahn et al. utilize the watershed algorithm to remove skull on T1-weighted MR images [20]. There are some similar works, such as [21,22]. These methods depend on the correctness of intensity distribution modeling and are sensitive to intensity bias. The deformable surface-based methods evolve and deform an active contour to fit the brain surface. A popular tool named the Brain extraction tool (BET) employs a deformable model for separating brain and non-brain from MRI [23]. BET2 is the extension of BET, which generates a better result based on a pair of T1-and T2-weighted MRI [24]. Other work, such as [25,26] also use the deformable surface-based method for the skull stripping. However, these methods rely on the location of the initial curve and the image gradient [10]. The atlas-based methods use the transferring knowledge of the anatomical structure of a template to separate skull and brain, such as work [27,28]. However, these atlas-based methods highly rely on the quality of image registration. Moreover, these methods are not applicable for the case with brain tumors/diseases.
In recent years, because of computer hardware development and big data availability, deep learning has been becoming prevalent in many domains, such as image analysis [29,30], natural language processing (NLP) [31], computer vision [32], speech recognition [33], etc. Deep learning-based methods are also applied to medical image analysis, including brain segmentation [34], brain tumor classification [35], brain tumor segmentation [7], and lung cancer segmentation [36], etc. The deep learning-based methods also apply for skull stripping, such as [37][38][39]. However, these works may be only applicable for normal healthy brain skull stripping, or pre-operative brain with gliomas. Therefore, to overcome the limitations mentioned above, we propose a 3D convolutional neural network (3DCNN)-based end-toend method for a generative skull stripping. It not only works for healthy brain MRIs, but also for pre-/post-operative brain MRIs with GBM. Furthermore, it is applicable for multicenter data.

III. Results
In the section, we first share the overall performance of skull stripping using the proposed method, then investigate the performance difference for several conditional MRIs (healthy brain MRIs, pre-operative brain MRIs, and post-operative brain MRIs), subsequentially estimate the model robustness across multicenter data, and finally compare with state-of-the-arts. a. Overall Performance of Skull Stripping As of the combination of all image sequences provides the best performance, we employ the best model for the testing data in the testing phase. With the total number of 216 testing cases, our algorithm offers an average dice of 0.9851 ± 0.017. The complete evaluation metrics are shown in Table 1. b. Generality of the Model As discussed early, the proposed method works not only in healthy brain MRIs, but also in pre-/post-operative MRIs.
To quantitatively evaluate the performance difference, we set up an experiment. The result is showing in Table 2. An interesting thing we noticed is that the best results happen in pre-operative brain tumor MRIs, rather than in healthy brain MRIs. The reason may be that the training data of the model are from the pre-operative mpMRIs with glioblastoma. Overall, the skull stripping performance is stable in all conditions, either the healthy brain MRIs, or brain tumor MRIs. There are 3 showcases shown in Figure 1. c. Model Robustness across Multicenter It is common that brain MRIs are acquired from multiple centers/institutes using different acquisition machines or following different protocols. The multicenter issue may undermine the performance of a model training with a singlecenter data. In this work, we also investigate the model robustness across multicenter. Additional to our in-house UPMC data (177 cases), we randomly take 39 cases (20 pre-operative cases and 19 post-operative cases) from TCIA that collects MRIs datasets from multiple institutes/hospitals. The experimental result is summarized in Table 3.  0.0531 ± .
4.2099 ± . The comparison of the summary indicates that the performance at TCIA is around 2% lower than that of data obtaining from the same center for model training. However, the skull stripping performance across multicenter achieves good enough for following medical image analysis. d. Comparison of State-Of-The-Art In the work, we also compare the performance of skull stripping using the proposed deep learning-based method to the popular methods/tools. In doing so, we either re-implement the algorithm, or directly use the published tools. The popular methods/tools include Brain Extraction Tool (BET) [23], 3d skull stripping (3dSS) [42], Robust Learning-Based Brain Extraction (ROBEX) [43], UNet 3D (UNet3D) [37], and DeepMedic by UPNN [38]. The first three tools are using traditional machine learning-based methods, and the last two are using deep learning-based methods. An example case showing contours overlaid with the multiparametric sequence is showing in Figure 2. The performance comparison is showing in Figure 3 and Table 4.  The performance comparison demonstrates that the proposed method offers the best results in terms of the dice, precision, recall, FPR, FNR, and the HD95. The small value of the standard deviation indicates the robustness of the skull stripping performance.

IV. Discussion
Even though there are extensive works on skull stripping in literature [16,24,[37][38][39], to best of our knowledge, none of the methods/algorithms have explicitly quantitative analysis of performance on different image sequence combinations. It is known that different image provides different brain information, therefore, multiparametric MRIs are widely used in radiomics brain research, include brain segmentation, and brain tumor segmentation. In this work, we are the first group quantitatively showing the performance difference with different image sequence combinations.
In the training phase, we randomly take 480 cases as the training dataset, and 119 cases as the validation dataset. We take the hyper-parameter setting as discussed in Section VI. The dice and loss change in the training phase and in the validation phase are plotted in Figure 6 and Figure 4. According to the result, it is easy to conclude that a combination of all four image sequences offers the best dice (0.9869 at epoch 300 in the validation phase) and least loss (0.0178 at epoch 300 in validation phase). Our model is reliable and has consistent performance in both the training and validation phase. In addition, we also apply the models obtained from training with different modality combinations to quantitatively compare the skull stripping performance in the testing phase, and the result shows in Figure 5.The comparison supports the conclusion: the convolutional neural network-based model with the integration of all image modalities offers the best results.

V.
The Proposed Method Deep neural networks have been becoming successful in many domains and achieve state-of-the-art performance for many applications. Therefore, in the work, we build a deep neural network-based method for skull stripping because of its advantages. The motivation for building a novel skull stripping has three facets. The first one is to process multiparametric brain MRI (mpMRI), which includes T1-weighted (T1), T1-weighted and contrast-enhanced (T1ce), T2-weighted (T2), and T2-fluid-attenuated inversion recovery (T2-FLAIR). The mpMRI offers a better result of skull stripping than that of a single image sequence. Moreover, it is easy to customize for any image sequence combination. Last, the proposed method is generative for all conditional cases, including healthy brain MRI, and pre-/post-operative brain MRI.
The whole workflow of brain extraction is showing in Figure 7. Firstly, we convert the raw digital imaging and communication in medicine (.dicom) multiparametric images into a compressed neuroimaging informatics technology initiative (.nii.gz) format, then change the orientation same as to the SRI24 atlas [40]. There are then two optional preprocessing steps: noise reduction and bias correction. Subsequentially, each imaging modality registers to the atlas, so that all image modalities are aligned into the same space. Finally, the co-registered images are fed into the proposed deep neural network model for skull stripping to obtain a binary mask. The co-registered brain extraction is accomplished by multiplying the binary mask to the co-registered images. The proposed architecture of a deep neural network is illustrated in Figure 8. There are two main parts of the network. The first encoder part is to extract high-dimensional features. The encoder part consists of several convolution blocks and max-pooling blocks. A convolution block is composed of convolution with residual connection, group normalization, and leaky rectified linear unit. Another part is a decoder, which is the opposite function to the encoder. The decoder expands the high-dimensional features to the target segmentation. It consists of convolution blocks and up-sampling blocks. In addition, we design an extra block (convolution block in green). The feature maps add the features from the regular decoder to enforce the training convergency. We name the proposed architecture as an ensemble neural network (EnNet).

VI. Materials and Experiment
All experiments in this study are performed in accordance with relevant guidelines and regulations as approved by the institutional IRB committee at the University of Pittsburgh. a. Dataset In this work, we use a total of 815 cases from multi-center for the experiment. Each case has mpMRIs which contain T1-weighted (T1), T1-weighted and contrast-enhanced (T1ce), T2-weighted (T2), and T2-fluid-attenuated inversion recovery (T2-FLAIR). Within the 815 cases, 776 cases are obtained from the University of Pittsburgh Medical Center (UPMC), and the rest of 39 cases are coming from The Cancer Genome Atlas (TCGA), which also collects data from multiple institutes. The size of the image varies among all cases. The image size varies from 256 × 256 × 23 to 512 × 512 × 89 , where 23 and 89 is the slice number of each case. For the atlas, the size of the SRI24 is 240 × 240 × 155. b. Experiment Setup Before skull stripping, there are several pre-processing steps, including image format conversion, orientation change, noise reduction, bias correction, and co-registration, as details discussed in Section V. In the experiment, all cases are split into training (480 cases), validation (119 cases), and testing dataset (216 cases). In the testing dataset, there are 177 cases and 39 cases from UPMC and TCIA, respectively. More specially, the 177 cases consist of 57, 57, and 63 cases for normal brain, pre-operative, post-operative cases, respectively. The 39 TCIA cases are composed of 20 preoperative and 19 post-operative MRIs. Note that the training and validation data are obtained from our in-house UPMC, but the testing cases are obtained from both UPMC and TCGA for evaluating the generality of the proposed method. c. Hyper-parameter Setting In each iteration, we randomly crop all co-registered MRIs with the size as 160x192x128 because of the limited capacity of the graphics processing unit (GPU). We believe that the cropped image covers the most region-of-interest (ROI). The batch size is set as 1 due to the large patch size and limited GPU memory. The loss function is computed as follows: where and are the class prediction and ground truth (GT), respectively.
We use Adam [41] optimizer with an initial learning rate of ! = 0.001 in training phase, and the learning rate ( " ) is gradually decayed by the following: where is epoch counter, and is a total number of epochs in training.

d. Evaluation Measurements
To quantitatively evaluate the performance of the proposed method, we employ several evaluation metrics in the work, such as dice, precision, recall, false positive rate (FPR), false negative ration rate (FNR), and Hausdorff distance at the 95 percentiles (HD95). They are calculated as follows:
/*01 { ( , )} , 95 12 ), (8) Where TP, FN, FP, TN are true positive, false negative, false positive, and true negative, respectively. Dice is a statistic matrix that measures the similarity of the prediction and ground truth. A value of 1 means that the two groups are identical, and a value of 0 shows no overlap at all between the two groups. The precision indicates how many of the positively classified are relevant. Recall, also known as sensitivity, represents how good a test is at detecting the positives. The Hausdorff distance (HD) measures how far two groups of a metric space are from each other. A smaller value of HD suggests more similarity.

VII. Conclusion
In this work, we propose a 3D convolutional neural network-based method to extract the brain. It is a fully automatic computer-aided method. The proposed method works generally for healthy brain MRIs, and pre-/post-operative brain MRIs with tumors as well. Moreover, the trained model using the proposed method is robust. It is not only applicable for in-house private data, but also for multicenter data. Comparing to the performance of state-of-the-art, the proposed method provides the best result. In addition, we first quantitatively evaluate the impact of skull stripping using different MRI sequences (combination). We conclude that the integration of all multiparametric MRI sequences offers the highest accuracy of brain extraction. In the future, we would like to train the deep learning model with more cases and apply the model to more multicenter data.

DATA AVAILABILITY
The partial datasets generated and/or analyzed during the current study are available in The Cancer Imaging Archive (TCGA) repository (link: https://www.cancerimagingarchive.net). The rest data are privately owned by University of Pittsburgh Medical Center (UPMC).

Author Contributions
LP designed and constructed the experiments and wrote the draft of the manuscript. MA, TN, ZS, KS, KA, and YM verified the ground truth of the experimental dataset and revised the manuscript. CL and EM revised the manuscript. CR supervised the whole project and revised the manuscript.