Subjects
Patients who complained of dysphagia or were suspected of having dysphagia underwent VFSS at Korea University Guro Hospital from September 2020 to September 2021 and were consecutively recruited for this study. In total, 594 VFSS videos were retrospectively reviewed. The exclusion criteria were as follows: (1) patients younger than 19 years of age; (2) patients unable to progress from the pharyngeal phase to the esophageal phase because full analysis of the temporal parameters in these cases was not possible; (3) incompletely recorded video; and (4) when the contrast of the video was too low to identify anatomic structures. Based on the exclusion criteria, 47 videos were excluded from the study. Consequently, 547 ground-truth data samples were collected. This study was approved by the Institutional Review Board of the Korea University Guro Hospital (IRB No. 2021GR0568).
Videofluoroscopy swallowing study analysis
The VFSS was performed by a single rehabilitation physician. The participants were seated in an upright position and swallowed with barium-mixed materials. The lateral view of the head and neck region was recorded at a frequency of 15 frames per second (FPS) using the radio-fluoroscopy system Sonialvision G4 (Shimadzu Medical Systems and Equipment, Japan). Various amounts and viscosities of materials were used for study and videos of swallowing 2 mL of thin liquid were used. The thin liquid was then mixed with barium (35% w/v). Two experienced rehabilitation physicians independently analyzed the VFSS video clips. The intraclass coefficient was 0.999 (p-value < 0.001). If any disagreements occur between the two clinicians, a consensus was reached through discussion. The analysis process was conducted as follows: (1) extraction of the first subswallow video, (2) major event labeling, and (3) temporal parameter measurement.
Extracting The First Sub-swallow Video
If there were multiple subswallows, only the first sub-swallow video was extracted. We set the first frame as the starting point of the oral phase, and the last frame as the point between the swallow rest and the start of the next subswallow.
Major Event Labeling
As described in Fig. 1, major event labeling was performed in a manner similar to the ASPEKT method devised by Steele et al. [8]. The adjusted definitions of major events are as follows:
Start of oral phase
This event is defined as the time when the bolus first enters the oral cavity. This event is also the first frame of the video clip used for temporal analysis.
Bolus past mandible
This is the first frame in which the leading edge of the bolus touches or crosses the ramus of the mandible. If the two mandibular lines did not overlap, the midpoint between the upper and lower mandibular lines was considered as the reference point.
Burst of hyoid bone
This is the first frame when the hyoid bone starts jumping anterosuperiorly.
Laryngeal vestibule closure (LVC)
This event is the first frame when the inferior surface of the epiglottis and arytenoid process contact. When LVC occurs completely, the air space in the laryngeal vestibule becomes invisible. When LVC occurs incompletely, the frame in which the arytenoid process and the inferior surface of the epiglottis that is the most approximate, is used.
Upper esophageal sphincter (UES) opening
The UES first opens as a bolus or air passes through it.
UES closure
This is the first frame in which a single point or part of the UES segment closes behind the bolus tail.
Laryngeal vestibule closure offset
This is the earliest frame in which the air space of the laryngeal vestibule becomes visible.
Temporal Parameter Measurement
Seven temporal parameters including oral phase duration pharyngeal delay time, pharyngeal response time, pharyngeal transit time, laryngeal vestibule closure reaction time, laryngeal vestibule closure duration and upper esophageal sphincter opening duration were measured using the labeled major events, similar to the ASPEKT method devised by Steele et al. [8]. (Fig. 1) Since fluoroscopy was projected at 15 FPS, using the frame number of the main events, the following temporal parameters could be calculated in milliseconds.
Development of automatic models
Figure 2 presents an overview of the proposed method for phase localization using deep learning.
Three-dimensional Convolutional Neural Network (3d-cnn)
CNN is a type of artificial neural network based on convolution operations. The convolution operation effectively extracts and combines local image features and is applied to the input image over multiple times each of which is called a convolutional layer. The two-dimensional (2D) CNNs are typically used to analyze 2D images in deep learning. To process video data, 3D-CNNs can be applied, which add the time dimension to 2D images for input processing, that is, video is processed as a temporal sequence of images. In this study, we adopted ResNet3D which is a type of 3D-CNN, as explained in the following section [18].
Models
The base architecture we used was ResNet3D-18 [18]. ResNet3D-18 has skip connections in which the input bypasses the intermediate layers and is fed directly to the output. Skip connections facilitate the training of deep neural networks. The model consists of multiple residual blocks containing several convolutional layers and one skip connection. The residual block is depicted in Fig. 3. We adopted three architectural variants of ResNet3D-18, which are described as follows: [18]
DEFAULT This variant use the default configuration of ResNet3D-18 with only changes in the number of input frames. In this study, we conducted experiments with an input frame length set to either 7 or 13. The label for the input was set to that of the center of the input window. For example, the label of the input of seven frames is set to one if the 4th frame of the input is in the phase of interest, and zero otherwise.
FILL-16 In this variant, we sampled either seven or 13 successive frames and repeated those frames to obtain an input length of 16, which is the default input length for ResNet3D-18 [18]. For example, if we use a seven frame input, the input is repeated twice, and the first two frames are added at the end to make the input 16 frames. As previously mentioned, the label for the input was set as that of the center of the original input window.
BIDIRECTIONAL We proposed a new architectural variant of ResNet3D. Inspired by the bidirectional recurrent neural networks (RNN), we introduced a bidirectional structure in the ResNet3D which captures the forward and the backward stream features of a target frame as follows [19] (Fig. 4).
The bidirectional model combines the predictions from two separate ResNet3D models. Specifically, we used two models, denoted by \(\mathbf{M}1\) and \(\mathbf{M}2\), each of which takes \(L\) frames as the input. Suppose that we would like to predict the label of the center frame of \(2L-1\) frames. The input for \(\mathbf{M}1\) is the first \(L\) successive frames, that is, from the first frame to the center of the loaded frames. The input for \(\mathbf{M}2\) is the last \(L\) successive frames, i.e., from the center of the loaded frames to the last frame. Thus, we expect \(\mathbf{M}1\) to learn forward temporal information, and \(\mathbf{M}2\) to learn backward temporal information relative to the center frame. The outputs from \(\mathbf{M}1\) and \(\mathbf{M}2\) are combined and concatenated, and the combined vector was passed to a fully connected layer to yield the final prediction output.
Experimental Settings
Datasets The total number of VFSS video data was 547. The video dataset was split into training, validation, test sets with sizes of 444, 49, and 54 videos, respectively. The model input was red, green and blue (RGB) video frames which were resized to 224×224 pixels.
Baselines We compared our model with those from prior studies on VFSS. Lee et al. proposed a model based on VGG-16, which is a neural network with 16 weight layers widely used for image recognition [15, 20]. The I3D described in Lee et al. is a network which has been widely applied to action recognition and classification tasks [16, 21, 22]. We trained the baseline models on our dataset and optimized the hyperparameters.
Transfer learning We adopted the following transfer learning technique. Deep learning models must be trained on a large amount of data. However, it is difficult to collect large-scale medical data because of privacy issues or the small number of participants. Transfer learning retrains a model pretrained on large datasets of generic images or videos. All the models considered in this study used transfer learning. For example, VGG-16 was pretrained on ImageNet which contains 200 categories of 14,197, and 122 images [20, 21, 23]. I3D was pretrained on Kinetics-600 [21, 24]. The ResNet3D variants in this study were pretrained using Kinetics-700 [25]. The Kinetics datasets contained 650,000 human-action video clips. Each clip was annotated with a single action class and lasted approximately 10 seconds. Kinetics-600 and Kinetics-700 contained videos of 600 and 700 action classes, respectively. The pretrained model yielded substantially higher performance than those that were not pretrained.
Performance Evaluation
The accuracy, F1 score, average precision (AP) were calculated to comprehensively evaluate the temporal localization of the phases in the test dataset.
$$accuracy= \frac{the number of correct prediction}{the number of prediction}$$
$$recall= \frac{the number of positive prediction on the positive frames (=TP) }{the number of positive frames}$$
$$precision= \frac{the number of positive frames predicted positive}{the number of positive prediction}$$
$$F1 score= \frac{2}{\frac{1}{recall}+\frac{1}{precision}}= \frac{TP}{TP+\frac{1}{2}\left(the number of incorrect predicton\right)}$$
True positive (TP) denotes the number of positive predictions fort the positive frames. AP is the area under the precision-recall graph, and the higher the AP, the better the upper bound of 1. AP is a widely used metric for the temporal action localization task in computer vision and is particularly relevant to cases with class imbalance, that is, if only a small fraction of the given frames are the phase to be detected, which was our case.