Utilising a cross-modal attention mechanism for multimodal feature integration may engender confounding effects, resulting indetrimental biases during modal interaction, and consequently impacting the outcomes of emotion classification. To addressthis issue, a cross-modal fusion network based on causal gating attention mechanism was proposed. First, a feature-maskingtext embedding module is utilised to enhance the semantic representation capability of both the audio and video modalities.Subsequently, a cross-modal attention fusion module is employed to complementarily merge the audio and video modalities,obtaining the fused audio-video modality features. Next, a causal gating cross-modal fusion network is used to fully integratethe heterogeneous data of text, audio, and video modalities. Finally, the sentiment analysis results are classified using SoftMax.The proposed cross-modal fusion network demonstrated superior performance in sentiment classification when compared tobaseline techniques on the CMU-MOSEI dataset. It effectively associated and combined pertinent multi-modal information.