Speech emotion recognition is a key branch of affective computing. Nowadays , it is common to detect emotional diseases through speech emotion recognition. Various detection methods of emotion recognition, such as LTSM, GCN, CNN, show excellent performance. However, due to the robustness of the model, the recognition results of the above models will have a large deviation. So in this article, we use black boxes to combat sample attacks to explore the robustness of the model. After using three different black-box attacks, the accuracy of the CNN-MAA model decreased by 69.38% at the best attack scenario, while the WER of voice decreased by only 6.24%, indicating that the robustness of the model does not perform well under our black-box attack method. After adver-sarial training, the model accuracy only decreased by 13.48%, which shows the effectiveness of adversarial training against sample attacks.