Network video typically contains a variety of information that is used by the video caption model to generate video tags. The process of creating video captions is divided into two steps: video information extraction and natural language generation. Existing models have the problem of redundant information in continuous frames when generating natural language, which affects the accuracy of the caption. As a result, this paper proposes a Multimodal Semantic Grouping and Semantic Attention Video Caption Model (VMSG). VMSG uses a novel semantic grouping method for decoding, which divides the video with the same semantics into a semantic group for decoding and predicting the next word, to reduce the redundant information of continuous video frames, which differs from the decoding mode of grouping by frame. Because the importance of each semantic group varies, we investigate a semantic attention mechanism to add weight to the semantic group and use a single-layer LSTM to simplify the model. Experiments show that VMSG outperforms some state-of-the-art models in terms of caption generation performance and alleviates the problem of redundant information in continuous video frames.