Music Emotion Recognition

. In the present world, there are many songs over the internet. But the information retrieval on these songs can be complicated. This paper intends to classify songs based on emotions using deep learning. We propose a strategy to recognize the emotion present in a song by classifying their spectrograms, which contains both time and frequency information. According to human psychology, neurons within a sub pop-ulation of our brain did not react the same way for all the emotions.So only speciﬁc neurons need to be triggered for identifying an emotion. Dif-ferent deep learning and machine learning algorithms are implemented to build music emotion recognizer. The main objective of this study is to study about the features which are important for audio ﬁle ,to de-velop a music emotion classiﬁer using deep learning algorithm and also to validate the model.The datasets are split into training and testing sets, models are trained with training data set. The accuracy of Artiﬁcial Neural Network (ANN) model is 79.7% ,K-Nearest Neighbor (KNN) model is 78.26% and logistic regression for gender classiﬁcation is 81%.


Introduction
Music is often referred to as a language of emotion. People tend to listen to different songs when in different moods. Music has been reported to evoke the full range of human emotions from sad, nostalgic, and tense, to happy, relaxed, calm, and joyous. Therefore, classifying music based on the emotion they express is more and more essential for internet music service providers. Considering the scale of current digital libraries an automatic tool for the classification is essential. Determination of the emotional category of music is quite challenging. Several issues need to be addressed such as emotion labeling of music excerpts, feature extraction and selection of the classification algorithm. To build a music emotion recognition system, labelled emotional music database is needed.

Data sets
The data set used for music emotion recognition is voice.csv, emotion features.csv, data moods.csv which are publicly available.The data set contains the features ⋆ Supported by organization RVCE. extracted and corresponding label which represents emotion of the music.emotionfeatures.csv, data-moods.csv datasets are for music mood classification where are voice.csv is for gender classification.

Audio Features
The audio features in the data set data moods are acousticness, danceability, energy, instrumentalness,liveness,loudness,speechiness,valence,tempo.The audio features in the data set emotion features are chroma,mel spectrogram,rmse, spectral centroid,spectral bandwidth,spectral rolloff,zero crossing rate.
These features are used as they have more influence to classify tracks. There are 4 categories of emotions in the data set used namely "energetic", "calm", "happy" and "sad".The labels for the music is given based on the emotion in that particular music.

ANN Model
First step of building a model is to pre-process the data.Pre-processing of data includes normalizing,encoding,splitting etc., Target variable and independent variables are defined followed by normalizing the data using 'MinMaxScaler' from sklearn.preprocessing library. The target variable is label encoded using LabelEncoder() function from preprocessing in sklearn library.Next the data is split into training and testing sets in the ratio 80:20,that is 80% of total data is used as training set and remaining 20% is used as testing set.   The labels of the data set are encoded using LabelEncoder() function. "Calm" is encoded as 0,"Happy" is encoded as 1,"Relaxed" is encoded as 2, "Sad" is encoded as 3.This is shown in figure 2  Figure 3 and 4 is the predicted labels for test and training data sets respectively.The predicted classes are numbers are label encoding is done.0 represents mood calm,1 represents mood 1,2 represents mood relaxed and 3 represents mood sad Figure 5 gives the confusion matrix and accuracy of the model.Confusion matrix is used to evaluate the performance of the model.The matrix compares the actual target values with those predicted by the machine learning model.The accuracy of the model obtained is 79.7%.

KNN Model
First step of building a model is to pre-process the data.Pre-processing of data includes normalizing,encoding,splitting,treating null values etc., Target variable and independent variables are defined followed by normalizing the data using 'MinMaxScaler' from sklearn. preprocessing library.The null values are dropped.The target variable is label encoded using LabelEncoder() from sklearn.preprocessing  library.Next the data is split into training and testing sets in the ratio 70:30,that is 70% of total data is used as training set and remaining 30% is used as testing set.   Figure 7 gives the confusion matrix and accuracy of the model along with best number of neighbors for the model in the given parameters.As there are 4 emotions the confusion matrix is of 4*4 size.The first list is figure 7 is the best number of neighbors in the given parameters to KNN model.The next array is the confusion matrix and the last one is the test accuracy of the model. The accuracy of the model obtained is 78.26%. Figure 8 and 9 is the predicted labels for test and training data sets respectively.The emotions are predicted using the trained model using the features extracted from the song. Figure 9 is the array of labels predicted for train data set.The training data set is given as input to predict function which predicts the label for given input. Figure 8 is the array of labels predicted for test data set.The testing data set is given as input to predict function which predicts the label for given input.    Figure 10 shows the Spotify URL of a song provided by the Spotify App. Prediction of the mood of the song "Blinding Lights by The Weekend" has been done by using the Spotify URL of Blinding Lights. Then the Spotify URL of the song has been passed into a function that has been defined as predict-mood. This function takes the Id of the song as an argument and includes inside the Neural Network model created.
Music Emotion Recognition 7 Fig. 11. Prediction of mood of the song as Energetic. Figure 11 shows that the result display that Blinding Lights by the Weeknd have an Energetic Mood. So,it is evidently Pretty accurate.

Logistic Regression
The results of gender recognition is given below.  Figure 12 is the sample of data set used for gender classification model.The sample shows the first 5 rows of the data set.It contains statistical features like iqr,mean,median,quartiles,skewness,centroid,mean frequency etc.,.One column contains label which represents the gender recognized in the song.
The labels are encoded using label encoder.Males are encoded as 1 and females are encoded as 0.  Figure 13 is the classification report of the gender classification model. Classifiaction report gives various performance measures like f1-score,precision,recall, support and accuracy.Precision is a metric that quantifies the number of correct positive predictions made. Precision, therefore, calculates the accuracy for the minority class. It is calculated as the ratio of correctly predicted positive examples divided by the total number of positive examples that were predicted. Recall is a metric that quantifies the number of correct positive predictions made out of all positive predictions that could have been made. Unlike precision that only comments on the correct positive predictions out of all positive predictions, recall provides an indication of missed positive predictions. In this way, recall provides some notion of the coverage of the positive class.
F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account.A good F1 score means that you have low false positives and low false negatives All the metrics give above 75%.There is no imbalance in the data and it is a good model as there is no much difference in performance metric values between 0 and 1.
Support is the number of instances in the data,that is from figure 13 number of 0 labels is 1584 and number of 1 labels is 1584.That is there are 1584 females and males in the data.
The accuracy of the model is 81%.  Figure 14 gives the confusion matrix of the model.It is a 2*2 matrix as only 2 labels are there.It gives actual vs predicted labels. Figure 15 and 16 is the predicted labels for test and training data sets respectively. Figure 15 is the array of labels predicted for test data set.The testing data set is given as input to predict function which predicts the label for given input. Figure 16 is the array of labels predicted for train data set.The training data set is given as input to predict function which predicts the label for given input. Table 1 is the summary of all the models with accuracy in percentage.

Conclusion
This paper implemented modified architecture to produce acceptable results with less no. of parameters involved compared to the traditional neural network and that is being achieved by disregarding unimportant neurons. Thus, the model was successfully able to accomplish goal and provide satisfiable results when tested on an integrated model with all the four emotions are namely Happy, Angry, Sad and Relaxed.The model developed based on the arousal and valence values of a song. The accuracy of ANN network is 79.7% , KNN model is 78.26% and that of logistic regression is 81%.