This study applied an efficient large-audio tagging model [20], which is known for its outstanding performance in sound analysis, to predict the presence of postprandial dysphagia at two levels (normal and aspiration). It demonstrates a high predictive performance, with the majority of the models achieving an AUC value of over 0.75, considering the diversity of people's voices. In particular, the mn30_as model, which had the highest number of hyperparameters among the trained models, demonstrated an AUC of approximately 0.7879 in the combined model and 0.7787 in the male model, indicating good performance in predicting dysphagia aspiration. Additionally, all other predictive performance measures for the combined and male models yielded high results, exceeding 70%.
Various studies on dysphagia aspiration have been conducted using non-invasive methods. The 3-ounce water swallow test showed a sensitivity of 59–96.5% and specificity of 15–59% when compared with FEES and VFSS. [24–26] The Gugging swallowing screen test had a sensitivity of 100% and a specificity of 50–69% in acute stroke patients. [27] Sensitivity and specificity for dysphagia based on language and speech-related dysfunctions were reported as follows: aphasia (36% and 83%, respectively), dysarthria (56% and 100%, respectively), and a combination of variables (64% and 83%, respectively). [28] Dysphonia, dysarthria, gag reflex, cough, and voice changes were used as diagnostic performance measures. [29] Other screening tools, such as the food intake level scale (FOIS), modified Mann assessment of swallowing ability test, and volume-viscosity swallow test (V-VST), etc., were also developed and subjected to performance validation. [16, 26, 30–37] While predictive performance varies depending on the research techniques, all of them require expert intervention for accurate diagnosis and monitoring, posing limitations on their applicability for everyday life monitoring. Efforts to observe voice changes during dysphagia monitoring are ongoing. [11–14, 38, 39]
Most previous studies on voice analysis in patients with dysphagia have focused on analyzing frequency perturbation measures (RAP, Jitter, PPQ, etc.), amplitude perturbation measures (Shimmer, APQ, etc.), and noise analysis (NHR) to differentiate between high- and low-risk groups. [11–14, 38, 39] Additionally, vocal intensity (MVI) and vocal duration measures (MPT) were used as voice analysis indicators. [38] Moreover, some studies have analyzed the correlations between these measures and established clinical diagnostic indicators for dysphagia, such as the penetration-aspiration scale (PAS), videofluoroscopic dysphagia scale (VDS), and American speech-language-hearing association national outcome measurement system swallowing scale (ASHA-NOMS). [38] Some studies have employed the Praat program to extract these sound parameters and analyze each indicator, either using voice-only or combining voice with clinical data indicators, trained with algorithms such as Logistic Regression, Decision Tree, Random Forest, SVM, GMM, and XGBoost. [12] Another study reported the results of dysphagia prediction using specific phonation or articulation features trained using support vector machine (SVM), random forest, and other methods. [39] However, these studies have limitations in that they only analyzed specific numerical indicators of voice and failed to analyze the overall voice itself.
Therefore, in this study, we trained a dysphagia prediction model using the entire voices of patients, represented as mel-spectrograms. Our model design focused on noise removal, prediction performance, and light-weighting for mobile integration. To reduce the noise from audio files, we implemented preprocessing steps from an efficient large-scale audio tagging model, resulting in improved prediction performance. [20, 21] Regarding the second consideration, we experimented with different models including the ResNet model, which is known for its excellent performance in CNN image recognition. [40, 41] However, its accuracy was relatively low. We also found that training the model solely on the Jitter, RAP, and Shimmer parameters did not yield stable results. Considering the recent advancements in machine learning for sound analysis, we ultimately chose the current learning model. Moving on to the third consideration, we focused on model light-weighting, to achieve real-time dysphagia diagnosis, monitoring, and intervention in mobile or resource-constrained environments. We converted the audio data from stereo to mono format, improving efficiency by eliminating the need for simultaneous processing of the two channels and enhancing voice recognition accuracy. [42] Additionally, we unified and compressed the files into mp3 format for real-time processing on mobile devices. [43, 44] Utilizing the HDF5 data format provides faster loading, increased storage efficiency, and compatibility with various programming languages. [45, 46] Throughout the study, we prioritized a compact model that occupied less storage space and enabled fast prediction of speech impairments. Employing MobileNetV3, a light-weighting and high-performance model, ensures the efficient execution of mobile devices. [47] We adapted the efficient large-scale audio tagging model [20, 21] as a reference, tailored to our specific data environment.
This study developed a model to predict dysphagia - aspiration based on the postprandial voice. The expected benefits of this study are as follows. First, by determining the occurrence of aspiration and providing clinicians with more parameters through voice, it enhances the clinical utility compared to previous studies. Second, it is anticipated that the diagnosis time for both outpatient and inpatient cases will be significantly reduced, providing additional diagnostic parameters for a more accurate assessment of dysphagia. Third, this study is expected to lay the groundwork for designing diagnostic, treatment, and management systems by integrating them with future developments, such as a mobile application-based dysphagia meal guide monitoring system.
Limitations
This study has several limitations. First, owing to the limited availability of voice data for individuals with dysphagia, we did not create a validation set, instead, we used a 9:1 training-to-testing data split (10-fold cross-validation). Second, due to the limited number of recruited female aspiration subjects, the female model showed lower performance compared with the combined model and male model. Third, voice data collection for healthy individuals and patients with dysphagia occurred in different environments and with varying numbers of participant, whereas the diet types were not standardized. Fourth, as a mel-spectrogram-based machine learning model, we lacked characteristic parameter extraction, which is similar to conventional voice indicators. In future studies, we aim to develop a more predictive model with better performance by recording a more diverse range of voices and diet types in patients with dysphagia, and comparing voice changes before and after meals.