A. Dataset
In this study we have used two datasets. The data for training the segmentation model has been captured from the 2019 Medical Segmentation Decathlon Challenge. It consists of 105 patient data and 90 healthy subjects. The 3D structural MRI data were captured by a 3D T1-weighted Magnetization Prepared Rapid Gradient Echo Imaging (MPRAGE) sequence (TI/TR/TE, 860/8.0/3.7ms; 170 sagittal slices; voxel size, 1.0 mm3) all with the same machine. Tracing of head, body, and tail of the hippocampus have been performed on the entire data [39]. The data used to test the pipeline end to end has been acquired from Mount Sinai Medical Center (MSMC), Miami Florida as part of the data for the 1Florida Alzheimer’s Disease Research Center (ADRC). We first apply the MR-based skull-stripping technique to extract the brain from each MRI scan. Then the hippocampus is segmented in each brain image separately. The volumetric results from this second dataset are compared with the volumetric results obtained using FreeSurfer 6.0.
B. Evaluation Metrics
To quantitatively evaluate and compare the performance of the proposed method, four standard metrics were used. The mean Dice Similarity Coefficient (DCS) is used to measure overlaps between ground truth mask \({A}_{g}\) and predicted mask \({A}_{p}\).
$$DSC=\frac{1}{n}\sum _{i=1}^{n}\frac{\left|{A}_{{s}_{i}}\cap {A}_{{g}_{i}}\right|}{\left|{A}_{{s}_{i}}\right|+\left|{A}_{{g}_{i}}\right|}$$
1
The mean Jaccard Similarity Coefficient (JSC) is used to compare the similarity between \({A}_{g}\) and \({A}_{p}\).
\(JSC=\frac{1}{n}\sum _{i=1}^{n}\frac{\left|{A}_{{s}_{i}}\cap {A}_{{g}_{i}}\right|}{\left|{A}_{{s}_{i}}\right|+\left|{A}_{{g}_{i}}\right|-\left|{A}_{{s}_{i}}\cup {A}_{{g}_{i}}\right|} (2\) )
The precision Index shows the overlapping ratio between \({A}_{g}\) and \({A}_{p}\) over ground truth mask \({A}_{g}\) while Recall Index (RI) shows the overlapping ratio between \({A}_{g}\) and \({A}_{p}\) over predicted mask \({A}_{p}\).
$$Precision\hspace{0.25em}Index=\frac{1}{n}\sum _{i=1}^{n}\frac{\left|{A}_{{s}_{i}}\cap {A}_{{g}_{i}}\right|}{\left|{A}_{{g}_{i}}\right|}$$
3
$$Recall\hspace{0.25em}Index=\frac{1}{n}\sum _{i=1}^{n}\frac{\left|{A}_{{s}_{i}}\cap {A}_{{g}_{i}}\right|}{\left|{A}_{{s}_{i}}\right|}$$
4
All these metrics are calculated per sample and the mean of all the metrics over the test dataset is reported. A good segmentation method should produce a high value in all the metrics.
C. Loss function
Class imbalance between object and background remains a major issue that impacts the segmentation task, especially in neuroimaging. Since a small ROI is usually suppressed through max pooling layers, solutions based on optimizing the cross-entropy loss function are often unsatisfactory. To overcome this issue, besides localization step, a mixed Focal Dice loss function has been adopted for the model training. This loss function is a weighted combination of modified focal loss and modified focal dice loss. Focal loss was first introduced to address the problem of class imbalance faced by cross-entropy loss. To do so, it down-weights the contribution of easy examples which in turn enables learning from harder examples. In this study, as we face class imbalance from the segmentation problem, we have adopted a weighted combination of modified focal loss \({L}_{mF}\) and modified focal dice loss \({L}_{mFD}\) as below:
$${L}_{MF}=\lambda {L}_{mF}+\left(1-\lambda \right){L}_{mFD}$$
5
$${L}_{mFD}=\sum _{c=1}^{C}{\left(1-mD\right)}^{\frac{1}{\gamma }}$$
6
where \(\lambda \in \left[\text{0,1}\right]\) defines the relative weights of two components of the loss function and \(\gamma\) is the focal parameter. Parameter C is the number of classes. The \({L}_{mF}\) term in (5) is defined as in (7)
$${L}_{mF}=-\alpha {\left(1-{p}_{t}\right)}^{\gamma }.{L}_{mCE} \left(7\right)$$
The \({L}_{mCE}\)term is computed using Eq. (8).
$${L}_{mCE}=-\frac{1}{N}\sum _{i=1}^{N}\beta \left({t}_{i}-log\left({p}_{i}\right)\right)+\left(1-\beta \right)\left[\left(1-{t}_{i}\right)ln\left(1-{p}_{i}\right)\right]$$
8
$${p}_{t}=\left\{\begin{array}{ll}p& \text{if y = 1}\\ 1-p& \text{if y = 0}\end{array}\right.$$
The term \({t}_{i}\) refers to the Tversky index, an asymmetric similarity measure which is closely related to the Dice score and enables the optimization for output imbalance by tuning the weights assigned to false positives and false negatives. Details on the calculation of \({t}_{i}\) are provided in [40]. The \(\alpha\) term in the range of [0, 1] controls the relative weighting of the Dice and cross entropy terms contribution to the loss, and \(\beta\) controls the relative weights assigned to false positives and negatives. A value of \(\beta >\frac{1}{2}\) penalizes false negative predictions more than false positives.
D. Network Architecture Design
In this section, we introduce a novel framework for hippocampus segmentation. This framework is composed of two modules: 1) hippocampus localization, and 2) hippocampus segmentation. In the first module, a heuristic model estimates the hippocampus location in the brain and produces a cropped area surrounding the hippocampal tissue. In the second module, the cropped area is passed through a segmentation model.
The design of the first module is inspired by [30], [31], [37]. This is a heuristic algorithm which first performs a 3D skull stripping to extract brain volume. Based on the ratio of the acquired volume to the average of the training set volume, it performs a relative distance estimation. This defines the distance of the first slice of the brain to the first slice where the hippocampus appears in all three views of coronal, sagittal and axial to determine a rough estimate of the hippocampus’ location in the 3D MRI.
In the second module, the cropped area surrounding the hippocampus is fed into a transformer-based segmentation model. The segmentation algorithm is inspired by the architecture design of TransUnet proposed by Chen et al. [41]. This new design is based on integration of Vision Transformer and UNet model, which has shown promising segmentation results on abdominal CT scans but has not been explored for 3D brain MRI segmentation. The latter task is viewed as even more challenging problem given the difficulty in delineating different regions of the brain in 3D MRI.
The hippocampus segmentation problem falls in the category of an imbalanced segmentation as the proportion of the region of interest is much less than that of the background. To improve the original implementation of the model for handling this imbalanced case, a combined loss function based on the focal loss approach has been adopted. Data augmentation techniques are applied to the data (random rotation and flipping) before feeding the data points into the segmentation model. The proposed pipeline is depicted in Fig. 1.
UNet architecture which was proposed in 2015 by Ronneberger et al. [42], has been one of the most dominant methods in medical image segmentation. Since then, this model has served as a building block of many other image segmentation models. In contrast to object detection which draws a bounding box around the subject and defines its corresponding label, in image segmentation, a fine binary map draws over the images and classifies each pixel separating the background and object/region of interest. UNet architecture is composed of two main paths: an encoding and decoding path.
The encoding path is made up of multiple convolution layers are followed by a max-pooling layer. Through this path, the model learns spatially relevant contextual information. A reverse decoding path adds precise localization to yield final a segmentation with a similar size to input image.
To improve the UNet architecture, in 2018, Zhou et al. re-designed the encoder and decoder paths and skipped the connections of the original UNet to introduce UNet++. The pathways in UNet + + are composed of a series of nested dense connections that reduce the semantic gap between the encoder and decoder’s feature maps. This strengthened connections in Unet + + has shown considerable improvement in segmentation tasks [43].
The attention mechanism was first proposed for natural language processing tasks and more recently expanded to the image processing and computer vision domains. Attention mechanism draws from human vision, in that once we know the context in which an object appears in a scene, we look for that same context when we search for that object in the future. Multiple research endeavors have improved their design by adding an attention mechanism in conjunction with convolution layer, one of them is MANet. While many of the proposed UNet architectures are based on multi-scale feature fusion, MANet suggests a new attention-based model.
While there are several studies which have exploited attention mechanism for image classification, a fully transformer-based model has been proposed by Google research team more recently in late 2020. This architecture is identical to the original transformer model proposed for Natural Language Processing (NLP). It processes a sequence of image patches like NLP tokens for image classification tasks. Vision Transformer has shown promising results compared to the state-of-the-art CNN models if it is trained on a large dataset for enough time while requiring substantially fewer computational resources to train [44].
This novel transformer-based architecture is applied to image classification purposes using an encoding module. For each image that is processed the model predicts a label. To make this model applicable to more complex tasks such as object detection and image segmentation, some modifications to the architecture are essential. To apply this model to a segmentation task, Chen et al. have coupled this transformer-based architecture with a decoding module inspired by UNet. They also applied the transformer encoding on the feature maps extracted from the third layer of a ResNet50 network. They selected this design after failing to obtain compelling results by following solely the original architecture which directly tokenized the original image.
This CNN-Transformer hybrid design performs better than pure transformer encoding as it allows the network to exploit high resolution CNN feature maps in the reconstruction path. The reconstruction path consists of several up-sampling units. The very first reconstruction unit gets the output of the transformer encoder and after up-sampling it, it concatenates the current feature map with the feature map of the last CNN layer of the ResNet in the corresponding encoding path to incorporate multi-scale information into the model. The outcomes pass through a 3x3 convolutional and Rectified Linear Unit (ReLU) layer to form the input for the next reconstruction unit. The same process applies two more times on the resulting output of each layer until the decoder reconstructs the segmentation task in the original size of the input.
A segmentation head is added at the end of the reconstruction path which classifies each pixel to its corresponding class and recovers the segmentation mask with the same resolution of the input image. The network architecture adopted from TransUnet is depicted in Fig. 2. To ensure a better performance in an imbalanced segmentation task, we have changed the original loss function to a combined focal loss inspired by [40].
In our design, unlike in the original Vision Transformer model, the image is passed through a CNN model to generate a rich feature map. Furthermore, the first few intermediate feature maps in the ResNet module are also kept, helping reconstruction in the up-sampling path. The final feature map, which has a 2D shape, will be split into fixed size 1x1 patches. Patches are flattened and linearly projected to a new latent space. To retain positional information, position embedding is added to each patch separately as an input to the transformer encoder unit. The transformer layer consists of layer norm and Multi Head Attention (MHA) unit. In this model, 12 transformer units have been stacked on top of each other. The final feature map is bi-linearly up-sampled and concatenated with the corresponding feature map in the down-sampling path. Each up-sampling block consists of a 2 up-sampling operator, a 3×3 convolution layer, and a ReLU function.
Considering the enormous success of the feature extractor networks (ResNet, SeNet, EfficientNet, ExceptionNet) in many segmentation tasks, we have integrated them into our pipeline and studied them extensively when they were paired with 3 distinct segmentation models (UNet,UNet++,MANet) to evaluate their performance when dealing with highly imbalanced segmentation tasks with the specific challenges such as when dealing with convoluted structures like the hippocampus and low contrast margins between the different brain regions.