4.1 Dataset
The proposed method needs music files to extract the chorus parts and compute the mel spectrograms. Since there was no suitable dataset with the chorus parts and Valence/Arousal labels, we built a dataset that fit our needs using DMDD and MSDEN. DMDD consists of a dataset of Valence/Arousal labels for 18,647 samples on the Million Song Dataset (MSD) [33]. MSDEN is mapping MSD song IDs to IDs of other services, such as Spotify.
To build our dataset, first, we mapped MSD IDs to Spotify IDs using the MSDEN dataset. Then we used spotDL[1] to download the music files of mapped songs. spotDL is an open-source tool that finds pieces by Spotify URL on YouTube if a match is found and downloads them along with album art, lyrics, and metadata. Finally, the music chorus parts are extracted by Pychorus, and the log-mel spectrogram is computed using librosa with the configuration stated in section 3.1.
Since some MSD IDs were unavailable on the Spotify IDs list, and some Spotify music was unavailable on YouTube, we cannot get all 18,647 samples in DMDD. Also, Pychorus cannot find a chorus for some music.
The above process finally led to a dataset including 10,438 samples. Due to rights restrictions, the music files did not include in the dataset. We released the extracted log-mel spectrograms (the chorus part and music's first 15 seconds), labels, MSD IDs, Spotify IDs, artists, track names, and the Python script we used to download music files[2]. The test, train, and validation sets were separated in DMDD, and we also preserved each set for each sample. Table 1 lists the number of samples in each set. Figure 4 shows the distribution of data based on labels.
Table 2: Dataset details
Set
|
Number of Samples
|
Train
|
6213
|
Test
|
2026
|
Validation
|
2199
|
4.2 Experiment details
To experiment and analyze our method, the proposed method in this study is implemented by a Python script. The configuration of the experimental environment is listed in Table 2.
Table 3: Experimental environment configuration
Parameter
|
Value
|
Operating system
|
Windows 11
|
CPU
|
Intel® Core™ i7-12700H
|
GPU
|
NVIDIA GeForce RTX 3060 (Laptop, 140W)
|
Python
|
3.10.6
|
PyTorch
|
2.0.1
|
PyTorch Compute Platform
|
CUDA 11.7
|
Pychorus
|
0.1
|
Librosa
|
0.10.0
|
spotDL
|
4.1.11
|
Since the target labels are dimensional, we have chosen the R2 score and root-mean-squared error (RMSE) as evaluation metrics. The R2 score ranges from -1 to -1, with a higher value being better. The smaller the RMSE, on the other hand, the better. A batch size of 32 samples and training up to 100 epochs was chosen for training the model. Moreover, the network's training uses the Stochastic Gradient Descent (SGD) optimization algorithm with a learning rate 0.01 and MSE as a loss function. To prevent over-fitting early stopping and L2 regularization are used with rates of 0.00001 and 0.0001, respectively.
4.3 Experiment results
To verify the advancement of our proposed method, we compared our approach with some state-of-the-art deep architectures and baselines. Since some research [18, 24] proposed multi-modal methods that use textual data (lyrics), these methods could not be directly compatible with ours. So, we have ignored textual data and compared the result with the audio-based section of their works.
The benchmark metrics were computed separately on the chorus part and the first 15 seconds of music to show the effect of using chorus parts as input. The result is summarized in Tables 4 and 5. Training time for deep methods is listed separately for training on CPU and GPU. As shown in the tables, our proposed model has obvious advantages over the others, which get lower RMSE and higher R2 scores.
Table 4: Performance comparison with baselines results on the chorus part (15 seconds) and music's first 15 seconds
Model
|
Input
|
MSE
|
R2 (V)
|
R2 (A)
|
R2 (Overall)
|
Training Time (sec)
|
SVR
|
First 15 sec
|
0.1087
|
-0.0151
|
0.0668
|
0.0258
|
552 (CPU)
|
Chorus part
|
0.1049
|
0.0118
|
0.1097
|
0.0608
|
573 (CPU)
|
Decision Tree Regressor
|
First 15 sec
|
0.1062
|
0.0097
|
0.0874
|
0.0485
|
1 (CPU)
|
Chorus part
|
0.1058
|
0.0004
|
0.1069
|
0.0537
|
1 (CPU)
|
KNN
|
First 15 sec
|
0.1042
|
0.0178
|
0.1173
|
0.0675
|
1 (CPU)
|
Chorus part
|
0.1032
|
0.0124
|
0.1442
|
0.0783
|
1 (CPU)
|
Elastic-net LR
|
First 15 sec
|
0.1032
|
0.0276
|
0.1253
|
0.0765
|
40 (CPU)
|
Chorus part
|
0.1029
|
0.0198
|
0.1411
|
0.0805
|
39 (CPU)
|
Gradient Boosting Regressor
|
First 15 sec
|
0.1060
|
0.0186
|
0.0799
|
0.0493
|
544 (CPU)
|
Chorus part
|
0.1052
|
0.0157
|
0.1005
|
0.0581
|
458 (CPU)
|
AdaBoost Regressor
|
First 15 sec
|
0.1059
|
0.0155
|
0.0868
|
0.0511
|
515 (CPU)
|
Chorus part
|
0.1047
|
0.0079
|
0.1182
|
0.0630
|
515 (CPU)
|
Our Model
|
First 15 sec
|
0.0992
|
0.0805
|
0.1402
|
0.1103
|
287 (CPU)
|
Chorus part
|
0.0957
|
0.1207
|
0.1618
|
0.1412
|
251 (CPU)
|
Table 5: Performance comparison with state-of-the-art results on the chorus part (15 seconds) and music's first 15 seconds
Model
|
Input
|
MSE
|
R2 (V)
|
R2 (A)
|
R2 (Overall)
|
Training Time (sec)
|
Delbouys et al. (2018)
(Audio Net Only) [24]
|
First 15 sec
|
0.1040
|
0.0300
|
0.1067
|
0.0684
|
49 (GPU)
88 (CPU)
|
Chorus part
|
0.1026
|
0.0546
|
0.1046
|
0.0796
|
38 (GPU)
83 (CPU)
|
Chaudhary et al. (2020) [22]
|
First 15 sec
|
0.1016
|
0.0695
|
0.1054
|
0.0874
|
122 (GPU)
241 (CPU)
|
Chorus part
|
0.1014
|
0.1014
|
0.0494
|
0.0841
|
121 (GPU)
238 (CPU)
|
Pyrovolakis et al. (2022)
(Audio Net Only) [18]
|
First 15 sec
|
0.1048
|
0.0067
|
0.1186
|
0.06269
|
1650 (GPU)
3135 (CPU)
|
Chorus part
|
0.1043
|
0.0089
|
0.1260
|
0.0674
|
1797 (GPU)
5382 (CPU)
|
Han et al. (2023) [23]
|
First 15 sec
|
0.1087
|
0.0213
|
0.0222
|
0.0217
|
114 sec (GPU)
823 sec (CPU)
|
Chorus part
|
0.1036
|
0.0442
|
0.0958
|
0.0700
|
112 sec (GPU)
820 sec (CPU)
|
Our Model
|
First 15 sec
|
0.0992
|
0.0805
|
0.1402
|
0.1103
|
70 (GPU)
287 (CPU)
|
Chorus part
|
0.0957
|
0.1207
|
0.1618
|
0.1412
|
60 (GPU)
251 (CPU)
|
Our model performed best when chorus parts were used. The outcome also demonstrates that using chorus parts can enhance other methods. In the case of arousal, we observed a 15% relative improvement in R2 of the best models and a 40% relative improvement in the case of the valence regressor. As a result, the valence regressor improved more than the arousal regressor. According to the results, our proposed model performs better than most stated methods regarding training time. Figures 5 and 6 represent R2 score and Loss (MSE) values for training and validation data.