This study was approved by the Institutional Review Board of Yeungnam University hospital. It was approved by the Institutional Review Board of Yeungnam University hospital that informed consent was not required due to the retrospective nature of the study and the use of anonymous clinical data. All procedures were carried out in accordance with the relevant guidelines and regulations. We included patients who visited the outpatient clinic of the rehabilitation department, who were admitted to the rehabilitation department of one of the two university hospitals (Ulsan university hospital and Yeungnam university hospital) because of dysphagia, or who were diagnosed using VFSS between January 2009 and April 2020. The steps of the modeling process applied in this study are shown in Fig. 1.
Data collection
The VFSS data of 190 participants with dysphagia were collected. The exclusion criteria were as follows: (1) patients of age less than 20 years; (2) patients who had undergone tracheostomy; (3) patients with facial or cranial anomalies; and (4) patients having metal plate in the cervical spine or facial bone that could develop an artifact.
Analysis of VFSS
When the VFSS was performed, the patients were instructed to seat upright under a videofluoroscopy machine with the head in a neutral position. Boundaries for the frame of videofluoroscopy included the incisors anteriorly, cervical vertebrae posteriorly, nasal border of the soft palate superiorly, and cervical esophagus inferiorly [16, 17]. The fluoroscopic images of swallows were digitally recorded and stored at 30 frames/s [16, 17].
Each VFSS was performed using a bolus of ‘‘thin’’ fluid (1–50 cP). Each patient received a 5-ml bolus delivered using a 10-ml syringe [16, 17].
In the analysis of VFSS, the presence of penetration was determined when the contrast material passed above the true vocal cord, and not below [18]. The presence of aspiration was determined when the contrast material passed below the true vocal cord [18]. Based on the above criteria, the presence or absence of penetration or aspiration in the dynamic fluoroscopic images was reviewed by two rehabilitation medicine specialists with more than 10 years of clinical experience in dysphagia. Based on the VFSS, patients were classified into normal (without penetration and aspiration), penetration, and aspiration groups.
VFSS image selection
To analyze VFSS by deep learning, we selected five consecutive frame images (at 0.33-s intervals) from the VFSS, back and forth, when the hyoid bone reached the peak (the highest position of the hyoid bone; high-peak image), and another five consecutive frame images from the VFSS when the hyoid bone completely descended from the peak (the lowest position of the hyoid bone; low-peak image) (Fig. 1). Therefore, 10 frame images were selected from one swallowing process (five high-peak images and five low-peak images) for the application of deep learning in the VFSS video of a patient with dysphagia (Fig. 1).
Deep learning analysis
We applied a convolutional neural network (CNN) for deep learning using the Python programming language. TensorFlow 2.4, the Keras framework, and scikit-learn toolkit 0.24.1 were used to train the CNN model. To achieve better learning outcomes, we employed a pre-trained CNN model with fine-tuning. The details and performance of the model are described in Table 2. A CNN consists of one or more convolutional layers, often with a subsampling layer; the convolutional layers are followed by one or more fully connected layers, similar to that in a standard neural network [19]. The deep learning models were trained using VFSS images as inputs to classify patients with dysphagia into normal (no penetration and aspiration), penetration, or aspiration groups. Of the study population (total 190 patients), 70% (n = 133), 20.53% (n = 39), and 9.47% (n = 18) were included in the training, validation, and test sets, respectively. Additionally, of the 950 images each for high-peak and low-peak images, 70% (665 images), 20.53% (195 images), and 9.47% (90 images) were used for training, validation, and test, respectively.
For obtaining the classification model according to VFSS findings (normal, penetration, and aspiration), the classification was initially conducted in both high-peak and low-peak images. We applied the following classification criteria: 1) normal: ≥ 4 normal images (of five images [separately for high-peak and low-peak images]); 2) penetration: < 4 normal images and no aspiration image; and 3) aspiration: < 4 normal images and ≥ 1 aspiration images. The two classifications from the high-peak and low-peak images were integrated into a final classification according to the following criteria: 1) normal: normal in both high-peak and low-leak images; 2) penetration: ≤ 1 normal (in the two classification results) and no aspiration; and 3) aspiration: ≤ 1 normal and ≥ 1 aspiration (Table 3).
Statistical analysis
Statistical analyses were performed using Python 3.7.9 and scikit-learn version 0.24.1. Receiver operating characteristic curve analysis was performed, and the area under the curve (AUC) was calculated. The confidence interval for the average AUC was calculated as bias-corrected and accelerated using the R 4.0.5 and multiROC 1.1.1 package [20].