Network intrusion detection, which takes the extraction and analysis of network traffic features as the main method, plays a vital role in network security protection. The current network traffic feature extraction and analysis for network intrusion detection mostly uses deep learning algorithms. Currently, deep learning requires a lot of training resources, and have weak processing capabilities for imbalanced data sets. In this paper, a deep learning model (MFVT) based on feature fusion network and Vision Transformer architecture is proposed, to which improves the processing ability of imbalanced data sets and reduces the sample data resources needed for training. Besides, to improve the traditional raw traffic features extraction methods, a new raw traffic features extraction method (CRP) is proposed, the CPR uses PCA algorithm to reduce all the processed digital traffic features to the specified dimension. On the IDS 2017 dataset and the IDS 2012 dataset, the ablation experiments show that the performance of the proposed MFVT model is significantly better than other network intrusion detection models, and the detection accuracy can reach the state-of-the-art level. And, When MFVT model is combined with CRP algorithm, the detection accuracy is further improved to 99.99%.