Collaborative filtering based recommendation methods make recommendations based on the preference information of other users who are similar to the user. The method is based on the assumption that human beings tend to form their own opinions from the reviews and opinions of similar groups. Compared with the original recommendation methods, the graph neural network recommendation method not only learns the topology of the graph network but also aggregates the various adjacencies of neighboring nodes, so that it can learn the information in the graph network more efficiently and play a key role for the subsequent recommendation. How to describe and represent the video and the user is the basis for video recommendation and determines the performance of the recommender system. Aiming at the above problems, this paper proposes a recommendation model based on a multi-graph neural network. The model utilizes the property that graph neural networks can mine the deep information of graph data more effectively and transforms the input user rating information and item side information into multiple graphs for effective feature extraction. From the perspective of multimodal semantics, a video similarity metric learning method based on semantic information is proposed. As videos contain different media types, different media types have different features. By analyzing the different media features and representing the multiple modal features in a unified way, we make a foundation for realizing the recommendation of videos.