Skeleton-based human behavior recognition has been widely studied due to its efficiency and robustness to complex backgrounds. While skeleton data accurately captures the dynamic changes in human posture, it overly relies on the quality of skeleton data and lacks interaction with the environment. In cases where skeleton keypoints are missing, using only skeleton data for behavior recognition results in significantly reduced performance. To address these issues, this paper proposes a method for behavior recognition that combines contextual semantics with skeleton detection. It fully considers the correlation between human skeletons, objects, and the interaction between humans and objects. When recognizing behaviors from human skeletons, this method simultaneously identifies objects near human skeletons and performs multimodal fusion after forming semantic information. It utilizes transformer-based semantic similarity calculation to determine the possible correlation between behaviors and targets and finally combines the scores from two stages to obtain the final prediction results. Experimental results show that on the UCF101 dataset, which is closer to real-world scenarios, the proposed method achieves an 8.4% improvement in accuracy compared to PoseConv3D.