The ability to ascribe mental states, such as beliefs or desires to oneself and other individuals forms an integral part of everyday social interaction. One task that has been extensively used to test mental state attribution in a variety of clinical populations is the animations task, where participants are asked to infer mental states from short videos of interacting triangles. In this task, individuals with clinical conditions such as autism spectrum disorders typically offer fewer and less appropriate mental state descriptions than controls, however little is currently known about why they show these difficulties. Previous studies have hinted at the similarity between an observer’s and the triangles’ movements as a key factor for the successful interpretation of these animations. In this study we present a novel adaptation of the animations task, suitable to track and compare animation generator and -observer kinematics. Using this task and a population-derived stimulus database, we demonstrate that an animation’s kinematics and kinematic similarity between observer and generator are integral for the correct identification of that animation. Our results shed light on why some clinical populations show difficulties in this task and highlight the role of participants’ own movement and specific perceptual properties of the stimuli.