With the continuous development of deep learning, video classification methods based on deep learning models have also made remarkable progress, and numerous models including dual-stream convolutional neural networks and 3D convolutional neural networks have been proposed and gradually become the current mainstream methods. In addressing video characterization especially web video and the application research based on it multimodal strategies have been widely used by researchers adopted. Nikhill et al. used a typical correlation analysis approach to study text-image correlation in cross-modal document retrieval for text and multimedia documents Image parts are jointly modeled and text and graphics are abstracted from the level of semantics. Experimentally, it is demonstrated that both have improved the accuracy of detection The accuracy of search matching is improved. Short video contains rich multimodal information, and the fusion of information from multiple modalities in the video classification task can improve the video accuracy of the video classification task. In this paper, a new combinatorial network model is proposed. The model combines the discrete features of each modality through a network The discrete features of each modality are combined into the overall features of each modality through a network. Then the overall features of the video are obtained by fusing the features of various modalities, and they are used for classification. In order to verify the effectiveness of the algorithm proposed in this paper, single-modal experiments and multimodal fusion experiments are conducted in the relevant dataset. The proposed network fusion technique can be better used for video classification tasks.