An effective recommendation system can not only increase the stickiness of platform users, but also the traffic brought by user conversion and retention can significantly increase the revenue of the video platform. Current recommendation systems enable recommendations between text, images, audio, and video, pre-tagging different media files to generate different tags, and then searching for user-requested content through search engines. However, with the development of video platforms, users' personalized needs are getting stronger and more complex recommendation scenarios, and the accuracy of video recommendations is becoming more and more demanding. Compared with the structured data in traditional databases, multimedia data is characterized by large dimensionality, rich data content, and the large storage space required for the data. Multimedia data contains information of multiple modalities, i.e., text, image, and audio. Although there are some works that analyze the content of videos, these works only consider the unimodal data of videos. In this paper, we will explore how to use multimodal techniques and knowledge graphs to enhance user representations, and try to extend the idea of multimodal feature learning to existing knowledge graphs recommendation algorithms.