Video Background Music Recommendation Based on Multi-Level Fusion Features

People resonate more with music when exposed to visual information, and music enhances their perception of video content. Cross-modal recommendation techniques can be used to suggest appropriate background music for a given video. However, there is not a simple correspondence between the different modal data. Therefore, to explore the association between the two modalities of video and music, we propose MFF-VBMR, a video background music recommendation model based on multi-level fusion features. The model uses the cross-modal information of static, dynamic and emotional content of video and music to realize the task of matching and recommending suitable background music for a given video. Experimental results show that the proposed model outperforms other existing models in terms of performance and achieves satisfactory results for video background music recommendation.


I. INTRODUCTION
Video and music are two kinds of media that are widely concerned on the Internet, and people's perception of them is highly correlated. The stimulation of visual information resonates with people when they listen to music, while music echoes the video in visual perception and adds color to the scene.
There are related studies [1] suggest that human perception of music may have evolved from an ancient skill, namely the ability to interpret emotions from movements. It has been found that when people pair an emotion with a melody or a video animation, they choose combinations that have the same temporal and spatial characteristics, such as the same tempo, rhythm and smoothness, for the pairing.
Thus, the two modalities of video and music are not simply in correspondence, but there are many correlations that make people unconsciously react in a similar way.Video and music are both widely used media, but the connections between them have not been well explored. Currently, some studies have focused on analyzing the underlying semantic features of both video and audio modalities [2], [3]. These single features are not tightly connected enough to capture the key information of video and music, resulting in loose connectivity of multimodal features in feature space and poor retrieval results in cross-modal retrieval recommendation tasks. Meanwhile, existing cross-modal recommendation methods mostly use the construction of a common subspace [5]- [7] that rely on the mathematical relationship between the two feature vectors to capture similarity, which tends to lead to too homogeneous multimodal information and poor model results.
Therefore, in order to explore the association between the two modalities of video and music, as well as the prevalence of people's perception of multimodal information, this paper proposes a method for video background music recommendation based on multi-level fusion features, which utilizes crossmodal information based on static, dynamic and emotional content corresponding to the different levels generated by video and music to achieve the task of matching and recommending appropriate background music for a given video.
Specifically, the contributions of this paper are divided into three areas.
• We propose MFF-VBMR, a video background music recommendation model based on multi-level fusion features. The proposed model can synchronously process the preprocessed multimodal feature data and find the key pairs of multimodal information by using the correlation framework of video and music. • In feature selection, the similarities and differences of audio and video features are summarized, and the feature extraction is more comprehensive, from the three aspects of audio and video, static, dynamic and emotional extraction features, which are more representative of audio and video depth content. • In the retrieval task, the CNN algorithm is modified to make the cross-modal information matching more accurate. Feature normalized convolution similarity algorithm (FNC) is proposed. To make the contribution of each feature vector equal to the similarity calculation, the extracted feature vectors are normalized by L2. At the same time, a self-attention mechanism is introduced and weighted according to the captured audio and video feature vectors.

II. RELATED WORK
Multi-modal cross-modal retrieval tasks have also been extensively studied in recent years. Zeng et al. [19] produced a better representation by constructing a deep triplet neural network with triplet loss to obtain the best projection that maximizes the correlation in the shared subspace. Wang et al. [20] proposed a novel two-branch fine-grained transmedia network and designed an efficient range-metric fine-grained transmedia retrieval that is well adapted to handle fine-grained retrieval scenarios. In order to process information efficiently, liu et al. [21] used information entropy to extract key frames, which reduced the computational burden in the process of feature calculation and feature comparison of each frame, thus shortening the detection time of key frames. Kaur et al. [22] propose a new framework for unsupervised cross-modal retrieval based on associative learning and then correlated together using Hebbian learning networks to facilitate the cross-modal retrieval process. Yet, the depth of individual features is not sufficient, and the mapping relations of the co-subspace network are somewhat divergent.
On the study of background music recommendation, Kuo et al. [13] proposed a framework to discover associations between sentiment and music features for music recommendation. They investigated the extraction of musical features and suggested the use of affinity graphs to discover associations. Yu et al. [14] used location information from UGV to map tags to emotion tags by investigating categories on the website and then comparing them to music emotions, but did not consider the visual content of the videos. Later, Wang et al. [23] analyzed the emotional connection and similarity between video and music from the perspective of audiovisual synesthesia. To address the limitations of single-feature and similarity algorithms, we use different levels of multimodal fusion information and build similarity learning network to calculate the similarity between video and music.

III. METHOD MODEL
This paper proposes a video background music recommendation model (MFF-VBMR) based on multi-level fused features. First, we perform feature extraction based on obtaining deep-level feature correspondence between video and music data. Second, we fuse multimodal features. Finally, we feed the fused multimodal features of video and music into our proposed feature normalization convolutional similarity algorithm network to obtain the optimal solution for music recommendation, the model framework is shown in Figure 1.

A. Multimodal feature analysis
When people watch a video with music playing in the background, they relate to the video and the music, and this feeling of similarity is present not only in the video butss-also in the music. In order to match the recommended soundtrack to the video, we explored the cromodal alignment between video and music at three levels of analysis: static, dynamic and emotional.The features are summarized in Table 1. In this paper, video and music features are extracted from the categories described in Table 1.

B. Feature extraction
Video feature extraction In this paper, the Inception network is used to extract static features of video key frames, and the output result is the video feature vector of each frame.This paper uses optical flow features as dynamic features of video. Optical flow f t (x, y) ∈ R H×W ×2 is the measurement of two consecutive frames I t , the I t+1 ∈ R H×W ×3 the displacement of a single pixel between them. Similar to distance and velocity, we define the optical flow amplitude F t as the average of the absolute optical flow to measure the amplitude of motion in frame t.
The video is extracted with T key frames and the motion relationship between the key frames is to be calculated. The motion saliency at frame t is calculated as the average positive change in optical flow in all directions between two consecutive frames. The saliency will have a larger value when there is a sudden visible change in the keyframe. The corresponding optical flow feature vector is finally obtained. Music feature extraction We use the audio feature extraction tool liborsa [15] to extract static and dynamic features from music. We compute the mean and variance of each frame-level feature and the maximum top K-order statistic, and concatenate all the static-level features. Liborsa can extract the tempo and beat of the music, estimate the tempo and tune speed. We use openSMILE to extract emotional features of music. OpenSMILE provides a variety of standard feature sets for emotion recognition, we use "emobase2010" with some adjustments to the normalization of duration and location features.
C. FNC Network 1) Similarity calculation: We propose the feature normalized convolutional similarity algorithm FNC, which is a twopart network for the processing of representative video-music features and the computation of similarity between video and music pairs.
Since the result of convolution in a CNN network can be considered as an inner product of vectors consisting of convolution kernel and convolution region, the inner product represents the similarity of two vectors. The model is then able to learn similarity patterns in the CNN subnetwork, which operates on the similarity matrix between pairs of feature vectors. These similarity matrices containing all pairs of feature vectors are fed to the CNN to train the video-music level similarity model.
To make the contributions of each feature vector equal, L2 normalization is introduced to process the extracted feature vectors. At the same time, it can resolve the balance between the number ratio of music and video features and alleviate overcrowding. But different key frames or scenes have different effects on the video. Similarly, sound clips and silent clips have different effects on the music. Therefore, we use the attention mechanism to weight the feature vectors of video and music.
The following attention mechanism is constructed in this paper: for the feature vectors of video and music, respectively V i,j and M i,k , where i ∈ [1, 2, 3] , and j ∈ [1, X], k ∈ [1, Y ] . We introduce the contextual unit vector u and use it to measure the importance of each region vector. To do this, we use the context vector u to calculate the importance of each V i,j and M i,k dot product between the region vectors to obtain a weight score a ij that b ik . The weight fraction is obtained by calculating the dot product between the feature vector h and the context vector u. Since all vectors have unit norms, the resulting weight score will be in the range [-1,1].
where s(x, y) is the dot product model, s(x, y) = x T y . Then, we concatenate the processed video and music feature vectors into a group based on each feature level to obtain the video feature vector group p j and the music feature vector group q k . The dot product calculation is performed to obtain the feature matrix S pq ∈ R X×Y = p j q k .
The generated feature matrix is then S pq into a four-layer convolutional network that has the ability to capture segment-level temporal patterns of video and music similarities. The convolutional layer setup is shown in Figure 2. In order to calculate the final similarity result of video music pairs, the output value of the convolutional network will pass through the Hard tanh activation function, and the chamber similarity CS is actually a mean-maximum filter processing on the output value, so that the similarity score F can be obtained, as shown in the formula: 2) Loss function: F , as the target music similarity score of the video, should be higher for relevant music and lower for irrelevant music. We use the input video data as Anchor, music that is similar to the video as Positive Music, and music that differs too much from the video as Negative Music. The loss function has the following equation, where α is the margin parameter.
In addition, we introduce a similarity regularization loss function, which provides significant performance improvement. The range of Hard tanh activation function is [-1,1], and the mechanism of loss function can drive the network to generate output matrix within this range. The sum of the values of all output similar matrices outside this range is the regularization loss.
The final loss function we define as: Where β is the regularization hyperparameter that adjusts the contribution of the similarity regularization to the total loss.

IV. EXPERIMENT A. Datasets and parameters
HIMV-200K [6] , a dataset consisting of 200,500 videomusic pairs. These video-music pairs were obtained from YouTube-8M, a large-scale labeled video dataset consisting of millions of YouTube video IDs and associated labels.
Pop music videos dataset [16] , pop music videos have a large number of camera angles, shots and movements that help to learn the relationship between the various videos and the corresponding music.
We downloaded the required video and music data from Douyin and PMEmo[49] respectively to construct the self-built dataset. The music data contains 1794 songs with sentiment annotations and a corresponding collection of videos with similar category labels.
We split the data into video and music triples. To train the data, we can only supply the network with one video music triplet at a time due to GPU memory limitations. We used Adam optimization with the learning rate set to 1 × 10 −5 .
For each epoch, T=1000 triplets are selected for each pool. The model was trained for 100 periods, i.e. 200K iterations, and the best network was selected based on the mean accuracy (mAP) on the validation set. Other parameters were set to α=0.5, β=0.1 and W =64, and the weights of the feature extraction CNN and the whitening layer were kept constant.
We selected recall rate, precision and average accuracy as evaluation criteria by referring to other methods.

1) Ablation experiments:
To evaluate the performance of model MFF-VBMR and the influence of each layer feature on the model recommendation, we conducted ablation experiments on the HIMV-200K dataset, Pop music videos dataset and self-built dataset. We have verified this for three cases. The first one uses static, dynamic and emotional features of video and music, respectively, for the recommendation task. The second approach is to use the method proposed in this paper. The third method does not use feature normalization and attention network. The experimental results are shown in Tables 2-4, where SF, DF and EF use only static features, dynamic features and emotional features respectively, and AT is the attention network.
It is evident from the experimental results that, regardless of the dataset on which the experiments were conducted, the recall and mean accuracy of the model using multi-layered fusion features for recommendation were higher than those 408 using only single-layered features, and our proposed model performed well on the background music recommendation task. This is because while single-layer features make connections between complex relationships between visual and auditory elements, only local object features of video and music are captured when other global contextual information is particularly important and needs to be complemented by single features. It is also experimentally demonstrated that the performance of the model improves somewhat when the method in this paper is not subjected to L2 normalization and attention mechanism, which is due to the fact that the multi-level features make the connection between video and music closer, but the performance of the model is further improved when L2 normalization and attention mechanism are added. This is because this feature processing approach is equivalent to imposing a hard constraint on the two branch features of video and music, resulting in an increase in the recognition accuracy of the model.
The expected values in the table refer to the theoretical value of R@K under this test dataset and are used to evaluate whether the method achieves a passing score.
The experimental results are shown in Tables 5-7. As can be seen from the experimental results, the values of Recall@K all exceed the desired data, indicating that all these methods achieve the desired goals and are good at the tasks of crossmodal retrieval and recommendation. Compared with other traditional methods, the model designed in this paper has superior performance.
The PR curves shown in Figures 3-5 visually show the performance of the six methods on the three datasets, from which the performance differences between the models can be clearly and intuitively seen. The model has achieved good results on the three datasets. In this paper, we propose a video background music recommendation method based on multi-level fusion features, and design a convolutional similarity calculation network. This approach exploits the multimodal information of video and music  to achieve the task of matching appropriate background music recommendations for a given video. Experimental results show that the proposed model improves the recommendation performance and achieves higher accuracy. This method has a certain reference value for the future cross-mode recommendation application. In the future, our study will further investigate other more cross-modal recommendation scenarios and consider improving the algorithm to improve computational efficiency based on large amounts of data information.