Speech Emotion Recognition (SER) plays a vital role in human-computer interaction as an important branch of affective computing. Due to inconsistencies in the data and challenging signal extraction, in this paper, we propose a novel emotion recognition method based on the combination of Adaptive Neuro-Fuzzy Inference System (ANFIS) and Particle Swarm Optimization (PSO) with Word to Vector (Word2Vec) models. To begin, the inputs have been pre-processed, which comprise audio and text data. Second, the features were extracted using the Word2vec behind spectral and prosodic approaches. Finally, the features are selected using the Sequential Backward Floating Selection (SBFS) approach. In the end, the ANFIS-PSO model has been used to recognize speech emotion. A performance evaluation of the proposed algorithm is carried out on Sharif Emotional Speech Database (ShEMO). The experimental results show that the proposed algorithm has advantages in accuracy, reaching 0.873 and 0.752 in males and females, respectively, in comparison with the CNNs and SVM, MLP, RF models.