Regression of 3D human pose from monocular images faces many challenges, especially for rare poses and occlusions. To solve these problems, we propose SR-ViT, a novel approach based on Split-and-Recombine and Visual Transformer for 3D human pose estimation. Our method first feeds the 2D joint coordinates of multi-frame images into the 3D feature extractor to obtain the 3D features of each frame. After feature fusion with the position embedding information, the global correlation between all frames is modeled by the Transformer encoder, and the final 3D pose output is obtained with a regression head, which achieves the estimation of the 3D pose of the center frame from consecutive multi-frame images and effectively solves the joint occlusion problem. By improving the structure of the 3D feature extractor and the design of the loss function, the prediction performance of rare poses is improved. The model performance is also enhanced by improving the self-attention mechanism in both global and local aspects. Our method has been evaluated on two benchmark datasets, namely, Human3.6M and MPI-INF-3DHP. Experimental results show that our method outperforms the benchmark methods on both datasets.