Recent generation Microsoft Kinect Camera captures a series of multimodal signals that provide RGB video, depth sequences, and skeleton information, thus it becomes an option to achieve enhanced human action recognition performance by fusing different data modalities. However, most existing fusion methods simply fuse different features, which ignores the underlying semantics between different models, leading to a lack of accuracy. In addition, there exists a large amount of background noise. In this work, we propose a Vision Transformer-based Bilinear Pooling and Attention Network (VT-BPAN) fusion mechanism for human action recognition. This work improves the recognition accuracy in the following ways: 1) An effective two-stream feature pooling and fusion mechanism is proposed. The RGB frames and skeleton are fused to enhance the spatio-temporal feature representation. 2) A spatial lightweight vision Transformer is proposed, which can reduce the cost of computing. The framework is evaluated based on three widely used video action datasets, and the proposed approach performs a more comparable performance with the state-of-the-art methods.