Inspired by the wide application of transformer in computer vision and its excellent ability in temporal feature learning. This paper proposes a novel and efficient spatio-temporal residual attention network for student action recognition in classroom teaching video. It first fuses 2D spatial convolution and 1D temporal convolution to study spatio-temporal feature, then combines the powerful Reformer to better study the deeper spatio-temporal characteristics with visual significance of student classroom action. Based on the spatio-temporal residual attention network, a single person action recognition model in classroom teaching video is proposed. Considering that there are often multiple students in the classroom video scene, on the basis of single person action recognition, combined with object detection and tracking technology, the association of temporal and spatial characteristics of the same student targets is established, so as to realize the multi-student action recognition in classroom video scene. The experimental results on classroom teaching video dataset and public video dataset show that the proposed model achieves higher action recognition performance than the existing excellent models and methods.