In this section, we first introduce the coronary IVUS dataset used for training and testing. Then, the 8-layer deep U-Net architecture that predicts the masks for the lumen and the EEM-CSA of IVUS images is presented. The training details are described, and the metric for evaluating the proposed method is illustrated.
Dataset and Augmentation
We use the coronary IVUS dataset from The Second Affiliated Hospital of Zhejiang University School of Medicine. It consists of in vivo pullback of coronary artery acquired by the iLab IVUS from Boston Scientific Corporation equipped with the 40-MHz OptiCross catheter. It contains IVUS frames from 30 patients, which are chosen at the end-diastolic cardiac phase in DICOM formats, with the resolution for each frame is 512×512. The dataset is split into two parts, 567 frames for training and 108 frames for testing, respectively. The training set is used for building the deep learning model and the testing set is used to evaluate the model performance.
IVUS images contain catheter, lumen, endothelium, intima, media, external elastic membrane, adventitia, atherosclerosis plaque. The external elastic membrane is usually treated as the borders of media and adventitia. The media is gray or dark as it contains dense smooth muscle. The adventitia is similar to external tissues surrounding the vascular walls. The endothelium and intima are thinner than the lumen and media. Thus, the lumen and EEM-CSA can be manually annotated by experienced physicians as the ground truth for metric evaluation. Each IVUS frame has been manually annotated for the lumen and EEM-CSA in the short-axis view by three clinical experts, daily working with the specific IVUS brand from the Cardiology Department, shown in Figure 1. Each expert is blinded to the other two experts’ annotations and every frame is repeatedly labeled by each of the three experts to ensure the correctness and blindness of the annotations.
The training set comprises of 567 frames, which is not large enough for training a CNN model from scratch. Data augmentation is essential for better performance. The augmentation is twofold and performed online. First, the coronary IVUS raw images and the corresponding ground truth are randomly (1) rotated at angles: 90o, 180o or 270o; (2) flipped up-down or left-right. Secondly, the MeshGrid is added to the raw image at pixel-level, providing the relative location information. Due to the relatively fixed position like intima and adventitia in IVUS images, MeshGrid could play a good guiding role in training process, which guides the neural network where to look. The ground truth makes no changes.
Model Architectures
The U-Net is one type of the fully convolutional network (FCN) [20] and is the most common convolutional network architecture for biomedical image segmentation. It consists of encoder and decoder parts and predicts segmentation mask at pixel-level instead of image-level classification. The encoder part is used for down-sampling and extracts higher-level features. The decoder part is used for up-sampling the output from the encoder part and concatenates the feature maps of the corresponding layer by skip-connection. The skip-connection is to relieve the gradient diffusion problem due to deep layers. The final decoder layer is activated by softmax to produce the class probability map to recover the segment predictions.
The encoder part has 9 blocks and each incorporates two repeated operations of 3×3 convolution, batch normalization (BN) and LeakyReLU activation. The down-sampling operation of 3×3 convolution with stride 2×2 reduces feature maps by half. The size of the 8th block is 2×2 to capture the deeper abstract information. The decoder part has 8 blocks to restore the image dimension. Each up-sampling operation contains a 5×5 deconvolution with stride 2ni. The skip-connection concatenates the corresponding feature maps. The last convolution of 1co outputs the probability map of mask class prediction by softmax activation. The entire architecture is shown in Figure 2.
Implementation Details
The model was trained and evaluated on Dell PowerEdge T640 server with Xeon Silver 4114 processor, 128GB of RAM, and four Nvidia GTX 1080Ti graphics cards. It took less than 90 minutes for training and 10ms per image for inference.
The deep learning framework used in this study was Tensorflow 1.13. The optimizer was Adam [21], which was fast and robust. The weights were initialized randomly and the batch size was set to 16. The initial learning rate was 0.001 with the decay of 0.1 every 2000 iterations. A total of 8000~10000 iterations were done for training. Lumen and EEM-CSA were trained and predicted at one shot with the softmax function as the output activation, which gave each pixel its class probability. The loss function was the sparse softmax cross entropy:
(1)
(2)
with K being the number of classes, being the predicted probability belonging to class j, pyj being the true probability.
Evaluation Criteria
In semantic segmentation, the mean intersection over union (MIoU) is the standard metric to evaluate the model, alternatively called Jaccard measure (JM). We compute the MIoU score between the ground truth and the predicted masks:
(3)
with k being the number of classes excluding background, pij being the number of pixels of class i predicted to class j.