In recent years, models based on the Swin Transformer architecture have significantly improved the performance of many computer vision tasks. However, the performance of the Swin Transformer architecture in gaze estimation tasks remains to be verified. In this paper, in order to introduce Swin Transformer into gaze estimation, Pure Swin Transformer(PST-Net) is proposed by improving Swin-T. Based on this, in order to further improve the performance of PSTNet and reduce the computational complexity, a new lightweight model, Hybrid Swin Transformer(HST-Net), is proposed. To the best of our knowledge, we are the first to apply Swin Transformer in gaze estimation. HST-Net significantly improves performance by adding CNNs before and after the Swin Transformer Block. Experiments show that HST-Net significantly outperforms PST-Net and that the model’s generalization performance is significantly better than previous methods. We further perform experiments to explore the effect of HST-Net based on a range of publicly available datasets and analyze the impact of the depthwise separable convolution after the Swin Transformer Block on the final results. The experiments demonstrate that after pre-training, HST-Net outperforms GazeTRHybrid in all benchmark tests, achieving optimal performance while reducing parameters by 56.4% and FLOPs by 75.4%.