Dataset
This nonrandomized retrospective study was approved by the Ethical Committee for Epidemiology of Hiroshima University (Approval Number: E-2119). All methods in this study were performed in accordance with the Ethical Guidelines for Medical and Human Research Involving Human Subjects, Japan. Because of the retrospective design of this study, the requirement for informed consent was waived by the Ethical Committee for Epidemiology of Hiroshima University by gaining consents using opt-out method. The study included MR images of 10 patients with anterior disc displacement aged between 19 and 39 years (mean age of 26.4; 8 women, 2 men), and 10 healthy control subjects aged between 18 and 41 years (mean age of 27; 8 women, 2 men), all with available medical records. Each subject underwent MR imaging on an Ingenia 3.0-T CX Quasar Dual scanner (Philips Healthcare, Best, the Netherlands). Only proton density-weighted sagittal images were used in this study. In total, 217 proton density-weighted sagittal images were used in this study, with these including the left and right TMJ regions with closed- and open-mouth positions; 106 images from the 10 patients and 111 images from the 10 control subjects.
Two expert orthodontists (12 and 6 years of experience) and one expert oral and maxillofacial radiologist (25 years of experience) independently identified and manually segmented all articular discs of the TMJ on the MR images using ImageJ software (version 1.53, National Institutes of Health, Bethesda, MD; Fig. 1). The manually segmented MR images were split into a training data set (80%) and test set (20%) for use in each of the following experiments. To derive a dataset showing the normal position of articular discs, the 111 images from the 10 control subjects were randomly split into 88 training images and 23 test images. For a dataset showing displaced articular disc positions, the 106 images from 10 patients were randomly split into 84 training images and 22 test images. For a dataset showing a mix of normal position and displaced articular discs, the 217 images were randomly split into 173 training images and 44 test images.
Deep learning algorithms
All procedures were performed using an Intel Core i7-9750H 2.60-GHz CPU (Intel, Santa Clara, CA), 16.0 GB RAM, and an NVIDIA GeForce RTX 2070 MAX-Q 8.0-GB graphics processing unit (NVIDIA, Santa Clara, CA). Deep learning algorithms were constructed using Python17 and were implemented using the Keras framework for deep learning with TensorFlow as the backend.
We adapted three convolutional semantic segmentation approaches: an encoder-decoder CNN, U-Net15, and SegNet16, which are all well suited to segmentation tasks. The overall architectures are shown in Fig. 2. In this study, we propose an encoder-decoder CNN model that we named 3DiscNet (Detection for Displaced articular DISC using convolutional neural NETwork), which has an asymmetric encoder-decoder architecture for the extraction of features at different spatial fields of view (Fig. 2A). To reduce the overfitting of the network, the dropout layer is placed behind the convolutional layers and max-pooling layers18. All the dropouts were given rates of 0.3 for the work described in this study. The final layer consists of a Sigmoid activation function that classifies each pixel as articular disc or background. The U-Net was a fully connected convolutional network that consists of convolution and max-pooling layers in the encoder part, and convolution and transpose layers in the decoder part. Encoder outputs were concatenated to the decoding layers to share spatial cues and to propagate the loss efficiently. The SegNet used a classical architecture for semantic pixel-wise segmentation, with encoder layers using max-pooling indices to upsample the feature maps and convolve them with a trainable decoder network. The original architectures of U-Net and SegNet are illustrated in Fig. 2B and C. The type of SegNet architecture used is currently termed SegNet-Basic19. The final layers are similar to the 3DiscNet, employing a sigmoid classifier instead of the original soft-max classifier in U-Net and SegNet-Basic. The U-Net and SegNet have shown promise for MR images semantic segmentation of organs and pathology20-22.
First, regions of interest (ROIs) around the articular disc were extracted from the datasets. The original image resolution was 512 × 512 pixels, and the ROIs, which were defined using a 161 × 184 pixel bounding box, were automatically cropped from the images using Python algorithms. The ROI images were then resized to 224 × 256 pixels for input into the three types of convolutional encoder-decoder network. (Fig. 3). The 3DiscNet was trained using the Adam optimizer with a learning rate of 1.0 × 10-3, and the three algorithms were trained for a total of 2000 epochs.
Performance metrics
The test data were used to validate the accuracy and computational efficacy of the models. The convolutional encoder-decoder network performance was assessed using the Dice similarity coefficient, sensitivity, and positive predictive value (PPV) of the test dataset. The Dice similarity coefficient, which is a popular similarity metric, was calculated using the following formula:
where P is the pixel area of the articular disc segmented with the convolutional encoder-decoder network, and T is the pixel area of the manually segmented ground truth ROI. The sensitivity is the percentage of the actual articular disc area correctly predicted as the articular disc area, defined as:
The PPV is a measure of the percentage of the correctly predicted articular disc area over the actual articular disc area as follows: