The important steps in LyBERT based training, evaluation and prediction process are as shown below:
5.1 BERT tokenization
This process includes converting the lyrics sentences to the format of BERT tokens. Here is a tokenized (sequences to be at a maximum of 128 tokens long) sample of the first training set observation of lyrics looks like:
['within', 'heart', 'every', 'man', 'symbol', 'deep', 'truly', 'all', '##des', '##cend', '##ing', 'power', 'unfortunately', 'still', 'asleep', 'may', 'put', 'hands', 'eyes', 'gleam', 'never', '##ending', 'much', 'turn', 'inside', 'conceal', '##s', 'understanding', 'five', 'pointed', 'grey', 'star', 'car', '##ven', 'sign', 'aryan', 'race', 'five', 'pointed', 'grey', 'star', 'car', '##ven', 'fore', '##hand', 'evil', 'face']
Here is a sample of the token representation shown above, which is converted to the input features which BERT can interpret:
Sample Lyrics: "within heart every man symbol deep truly all descending power unfortunately still asleep may put hands eyes gleam never ending much turn inside conceals understanding five pointed grey star carven sign aryan race five pointed grey star carven forehand evil face"
------------------------------ ------------------------------
Tokens : ['within', 'heart', 'every', 'man', 'symbol', 'deep', 'truly', 'all', '##des', '##cend', '##ing', 'power', 'unfortunately', 'still', 'asleep', 'may', 'put', 'hands', 'eyes', 'gleam', 'never', '##ending', 'much', 'turn', 'inside', 'conceal', '##s', 'understanding', 'five', 'pointed', 'grey', 'star', 'car', '##ven', 'sign', 'aryan', 'race', 'five', 'pointed', 'grey', 'star', 'car', '##ven', 'fore', '##hand', 'evil', 'face']
------------------------------
Input IDs : [101, 2306, 2540, 2296, 2158, 6454, 2784, 5621, 2035, 6155, 23865, 2075, 2373, 6854, 2145, 6680, 2089, 2404, 2398, 2159, 24693, 2196, 18537, 2172, 2735, 2503, 19819, 2015, 4824, 2274, 4197, 4462, 2732, 2482, 8159, 3696, 26030, 2679, 2274, 4197, 4462, 2732, 2482, 8159, 18921, 11774, 4763, 2227, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
------------------------------
Input Masks : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
------------------------------
Segment IDs : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
5.2. The LyBERT’s hyperparameters’ configuration
The LyBERT's experiment was carried out using different specifications. In this work, BERT-Base, Uncased model developed by Google with 12-layer was used. Moreover, the entire code implementation was carried out using Google Colab. According to the specifications given by the Google research community, the following values were used to initialize the hyperparameters and initialize the TPU configurations while using Google Colab with TPU. The batch size was set as 32, 64, and 128. Other specifications are as follows:
Learning rate= 2e-5, 3e-5 and 5e-5
Number of training epochs = 3.0
Warm up proportion= 0.1
LyBERT Model configurations
Save checkpoints steps = 300
Save summary steps = 100
The overfitting issue is prevented by adding a dropout layer.
5.3 LyBERT results on the validation dataset
Table 2 shows the average accuracy of running the LyBERT model 10 times using different learning rates and batch sizes. The average accuracy score is high as 92%. The precision, recall, and F1 score are the same at 0.96. The validation data categorized according to emotion class is illustrated below in figure 12:
The evaluation results showing accuracy, precision, and f1 score on the validation dataset are shown in Table 2.
Table 2
BERT_uncased_L-12_H-768_A-12/1 model implementation
Overall Accuracy
|
Precision
|
Recall
|
F1-score
|
91.69
|
0.96
|
0.96
|
0.96
|
The evaluation on validation dataset got the following outcomes: false_negatives: 46.0, false_positives 41.0, true_negatives': 558.0, and 'true_positives': 1112.0
5.4 Multi-class classification performance metrics
The multi-class classification results of BERT are shown in Table 3.
Table 3
Multi-class classification results
Classes
|
precision
|
recall
|
f1-score
|
support
|
0
|
0.82
|
0.83
|
0.82
|
1620
|
1
|
0.76
|
0.73
|
0.75
|
665
|
2
|
0.77
|
0.77
|
0.77
|
1583
|
3
|
0.72
|
0.67
|
0.69
|
525
|
macro avg
|
0.76
|
0.76
|
0.76
|
4393
|
weighted avg
|
0.78
|
0.78
|
0.78
|
4393
|
accuracy
|
0.78
|
0.78
|
0.78
|
4393
|
Table 3 shows the detailed multi-class classification results. The results shown in Table 3 are the mean value of the results for the ten experiments conducted under batch size 32 and learning rate 5e-5. The experiments conducted under other configurations by changing learning rate and batch size, but there was the results for all metrics were low or the same for the classes happy, angry and sad. In the batch size 32 and learning rate 5e-5, the class relaxed gave improved accuracy of 72. But in other batch sizes and learning rates, the accuracy for relaxed was 69%, and precision was also 69%. Classes 0, 1, 2, and 3 correspond to Happy, Angry, Sad, and Relaxed emotions. The precision, recall, f1-score, and support of Happy emotion are 0.82, 0.83, 0.82, and 1620, respectively. The precision, recall, f1-score, and support of Angry emotion are 0.76, 0.73, 0.75, and 665, respectively. The precision, recall, f1-score, and support of Sad emotion are 0.77, 0.77, 0.77, and 1583, respectively. The precision, recall, f1-score, and support of Relaxed emotion were 0.72, 0.67, 0.69, and 525, respectively.
The confusion matrix in the figure 7 shows the actual and predicted values of the four emotion classes used for the work. The cells which are arranged diagonally represent the true positive (T.P.) values of the classes concerned. For the emotion class, Happy, the T.P., is 1300, Angry has a T.P. value is 480, Sad has a T.P. is 1200, and relaxed has a T.P. of 360. Similarly, the True Negative (T.N.) for Happy is calculated as 2433 (480+63+27+100+360+1200+79+96+28), False Positive is calculated as 286 (60+180+46), and the F.N. value is calculated as 283(57+190+36). Happy and Sad emotion labels have more appeared in the test dataset when compared to relaxed and angry. Owing to this, the performance of LyBERT in angry and relaxed emotions is 76% and 67% less.
When the classes are merely or severely unbalanced, the Precision-Recall curve can be visualized to understand and calculate the quality of predictions created on the data. Precision measures the significance of the outcome, whereas recall is an indicative measure of how often relevant outcomes are returned. The precision-recall curve depicts the tradeoff between precision and recall for various thresholds. A large area under the curve indicates good recall and precision, with high precision indicating a low false-positive score and high recall indicating a low false-negative score. High recall and precision scores show that the classifier produces accurate (high precision) outcomes, and most of the outcomes (high recall) produced are positive. A model with recall with a high score and precision at low can return an enormous number of outcomes but predicted labels would be inaccurate in contrast to training labels. A model with a low recall score and precision at high can return fewer outcomes and the majority of predicted labels accurate in contrast to training labels.
The plot shown in figure 8 illustrates the quality of predictions for the LyBERT model. The class with a large area is class 0, which represents happy lyrics because the precision-recall scores are high for that class. The class with the second largest area is class 2, which represents sad lyrics and has higher precision-recall scores than classes 1 and 3. The class with the third-largest area is class 1, which represents angry lyrics. The class with the least area is class 3, which means that it has lower precision and recall scores when compared to the rest of the three classes.
Figure 9 shows a basic ROC (Receiver operating characteristic) curve plot of all the four classes of emotions. According to that curve, the true positive –false positive curve of class 0 has a large area under it. The decreasing order of area under the curve of the rest three classes that show the quality outcomes are class 1, class 2, and class 3. The area is measured and shown in the next plot shown in figure 10, the black diagonal dash line in the middle is the average line where TPR (True Positive Rate) = FPR (False Positive Rate).
Any line closer to the top-left area is a good prediction of a multi-class classification model. If a line is closer to the TPR=FPR line in the middle, it is not a good performance indicator. So, for example, Consider the yellow line; it is a better predictor than the red line because it is slightly higher than the red. But the blue line is not better than the yellow line even though it is higher because it is more towards the right side. Therefore, it is also necessary to check AUC called Area Under Curve and the ROC curve. This area is the total area under the ROC line. In this way, there is no need to measure which line is better by looking at it. By checking the area values and according to the graph, the light blue (Cyan, class 0) line is the best predictor with an area = 0.86. AUC and ROC curves suggest that the proposed lyrics emotion classification approach is a promising research area that can be further improved through transfer learning and sentence embedding. The following is a sample of randomly selected lyrics for emotion detection figure 11:
The prediction results of the sample lyrics in figure 11 are shown in figure 12.
The percentage of predicted emotion labels for each class is shown in the pie diagram below (Figure 13):
The bar diagram in figure 14 below shows the distribution of predicted emotion labels for each emotion class. The highest number of counts is predicted for class happy; the second highest is for class sad; the third-highest is for class angry, and the least number of counts for class relaxed. The number of samples in the subset of the dataset considered for this work contain the following amount of data: Class 0: n=7859 (35.781%), Class 1, n=3541 (16.122%), Class 2 n=8014 (36.487%), Class=3, n=2550 (11.610%). Since this work aimed to predict the lyrics' emotion labels, the subset as a whole was experimented with using the LyBERT model and have maintained the classes with high count(happy and sad) and low count(angry and relaxed) as it is.