Purpose
To develop and validate a multimodal deep learning approach integrating audio and text features for enhancing the detection of high-risk calls in crisis intervention hotlines.
Methods
The dataset originates from Hangzhou’s psychological crisis intervention hotline, encompassing 14,181 valid samples in 2023 for training. For evaluating real-time response performance, various lengths of audio and text samples were extracted and trained in multiple groups. Audio features were derived using the Librosa library, including 12-dimensional Chromagram features, 128-dimensional Mel spectrograms, and 40-dimensional Mel Frequency Cepstral Coefficients (MFCC). These features were fed into a pre-trained ResNet50 architecture to obtain an advanced audio representation. Text features relied on the Chinese RoBERTa-wwm-ext-large model for processing. Audio and text features were merged through Long Short-Term Memory (LSTM) networks, with Cross Entropy loss function utilized for training the deep learning model. After ten-fold cross-validation, model performance was assessed on an independent test set of 14,354 valid samples from 2022.
Results
On the independent test set from 2022, the model achieved an over- all accuracy of 0.75 with a 10-second audio input combined with 41 characters, recall rate for high-risk calls at 0.73, and an Area Under the Receiver Operating Characteristic Curve (AUC) of 0.79; for a 30-second audio input with 123 characters, the overall accuracy rose to 0.79, recall rate for high-risk calls reached 0.80, and AUC was 0.87; with a 60-second audio input and 246 characters, the model accuracy was 0.74, recall rate for high-risk calls was 0.87, and AUC was 0.85.
Conclusion
This study innovatively explores the integration strategy of voice- text multimodal deep learning techniques with crisis intervention hotlines, demonstrating promising potential in enhancing the efficiency of high-risk call identification through empirical results.