Most existing deep learning-based dynamic sign language recognition methods directly use either the video sequence based on RGB information, or whole sequences instead of only the video sequence that represents the change of gesture. These characteristics lead to inaccurate extraction of hand gesture features and failure to achieve good recognition accuracy for complex gestures. In order to solve these problems, this paper proposes a new method of dynamic hand gesture recognition for key skeleton information, which combines residual convolutional neural network and long short-term memory recurrent network, which is called KLSTM-3D residual network (K3D ResNet). In K3DResNet, the spatiotemporal complexity of network computation is reduced by extracting the representative skeleton frame of gesture change. Then, the spatiotemporal features are extracted from the skeleton keyframe sequence, and the intermediate score corresponding to each action in the video sequence is established after the feature analysis. Finally, the classification of video sequences can accurately identify sign language. Experiments were performed on datasets DHG14/28 and SHREC’17 Track. The accuracy of verification on dataset DEVISIGN D reached 88.6%. In addition, the accuracy of the combination of RGB and skeleton information reached 93.2%.