Attributing Minds to Triangles: Kinematics and Observer-Animator Kinematic Similarity predict Mental State Attribution in the Animations Task


 The ability to ascribe mental states, such as beliefs or desires to oneself and other individuals forms an integral part of everyday social interaction. One task that has been extensively used to test mental state attribution in a variety of clinical populations is the animations task, where participants are asked to infer mental states from short videos of interacting triangles. In this task, individuals with clinical conditions such as autism spectrum disorders typically offer fewer and less appropriate mental state descriptions than controls, however little is currently known about why they show these difficulties. Previous studies have hinted at the similarity between an observer’s and the triangles’ movements as a key factor for the successful interpretation of these animations. In this study we present a novel adaptation of the animations task, suitable to track and compare animation generator and -observer kinematics. Using this task and a population-derived stimulus database, we demonstrate that an animation’s kinematics and kinematic similarity between observer and generator are integral for the correct identification of that animation. Our results shed light on why some clinical populations show difficulties in this task and highlight the role of participants’ own movement and specific perceptual properties of the stimuli.

We thank Bodo Winter, Hélio Cuve, Jo Cutler and Olivier Codol for their assistance with statistical questions.

Introduction
Seminal work by Heider and Simmel 1 demonstrated that humans readily attribute mental states to two triangles moving around a rectangular enclosure. Since their inception in 1944 such "animations tasks" (also referred to as Frith-Happé Animations 2 and Social Attribution Task 3 ) have grown dramatically in popularity and have been used in a wide variety of clinical populations, including autism spectrum disorder (ASD) 2,4 , Schizophrenia 5 , antisocial personality disorder 6 , Huntington's disease 7 and Tourette's syndrome 8 . Though animations tasks have been scored and administered in a number of ways (Some studies count the number of mental state terms used to describe the movements of the triangles 2,4 , other studies have asked participants to rate the type of interaction or the mental state word depicted in the animations 9,10 ) it is generally agreed that "poor performance" indicates a problem with identifying the triangles as mentalistic agents and ascribing appropriate mental states to them.
We refer to these processes here as 'mental state attribution'.
Though mental state attribution has been found to be atypical across a range of clinical populations, little is known about why some individuals struggle to attribute appropriate mental states to the triangles. One explanation is that individuals who struggle with the animations task would exhibit atypicalities in other tests of mental state attribution because of a deficit in the ability to attribute minds and ascribe appropriate mental states. However, animations tasks tend to be more sensitive to mental state attribution difficulties compared to other tests, as shown by Abell et al. 2 .
A recent study highlights that kinematic similarities between the triangles' movements and the participant's own movements may influence performance on the animations task 9 .
Edey and colleagues asked autistic ('condition-first' terminology is used in line with the majority preference expressed in a survey of the autistic community 11 ) and non-autistic participants to complete the animations task, and also to produce their own animations using triangles that could be moved around an enclosure with magnetic levers. The authors found that animations produced by autistic individuals were more jerky (i.e. exhibited greater changes in acceleration and deceleration) than those produced by non-autistic individuals. Furthermore, whereas non-autistic participants could readily attribute mental states to animations created by other non-autistic participants, they had difficulties attributing mental states to the jerky animations that had been produced by the autistic participants. The authors proposed that movement similarity significantly contributes to performance in the animations task: that is, non-autistic individuals were better able to correctly identify animations created by other nonautistic participants because the movement kinematics in the videos were similar to the kinematics that they themselves would use to move the triangles. Conversely, autistic participants in in Edey's study did not show improved performance when rating their own group's relative to the control group's animations. The authors concluded that the increased variability in jerk present within this group lead to a reduced number of animations sufficiently similar to facilitate mentalizing performance in their autistic participants.
The proposal that movement similarity may affect performance in the animations task is bolstered by recent empirical work showing that observers more accurately estimate a human actor's underlying intentions when the kinematics of the actor's movements closely approximate the observer's own movement kinematics 12 . Furthermore, a role for movement similarity in mental state attribution is in line with theoretical accounts suggesting that inferences about others' actions are facilitated by mapping visual representations of others' actions onto our own visual/motoric representations of the same actions [13][14][15][16] . The movement similarity hypothesis would propose that mental state attribution difficulties in classic animations tasks may, at least in part, be explained by differences between the way the triangles are animated and the way an observer would move the triangles if required to create their own animation. This raises the possibility that clinical groups might exhibit accurate mental state attribution for animations where kinematics are matched to a participant's own movement kinematics. To better understand why some individuals struggle to attribute appropriate mental states in the animations task, the first aim of the current study was to test the hypothesis that a significant amount of variance in performance in the animations task would be accounted for by the kinematic jerkiness of the animation and the similarity between the kinematics of the animation and a participant's own movements.
Kinematic jerk and movement similarity are not the only factors which plausibly influence performance on the animations task. Previous studies have highlighted potential roles for stimulus features including the rotation of, and distance between, the triangles 17 , and the shape of the triangles' trajectories 18 . For instance, Roux et al. documented highly distinguishable trajectory paths for random, goal-directed and mental state animations, thus suggesting that trajectory path may be an important cue in mental state attribution.
Correspondingly, the second aim of the current study was to explore the extent to which a range of other stimulus features, including trajectory shape, influence the ease with which participants correctly attribute a mental state to an animation. By doing so, we shed light on a multiplicity of factors which may explain why some clinical groups find the animations task so challenging.
For this latter analysis we made use of the fact that, similar to a sound wave, a triangle's trajectory comprises a complex wave and thus can be decomposed with Fourier transform and represented as spectral density in different frequency bands 19 . In other words, Fourier transform can be used to characterize the shape of a trajectory. For example, a trajectory which approximately follows an elliptical orbit oscillates in speed and curvature twice during every full rotation and consequently would be characterized by high spectral density in a band centered around an angular frequency of two. Adapting a method developed by Huh & Sejnowski we explored whether there are particular angular frequency bands which differentiate mocking, seducing, surprising, following and fighting animations and whether spectral density in these bands was predictive of accuracy.
Currently available animation task datasets are not suitable to test our hypotheses for two reasons: First, having been created by experimenters or graphic designers, the stimuli in these tasks typically represent a narrow range of kinematics and thus lack the variation necessary for quantifying the contribution of kinematics and other stimulus features to performance. Second, tasks to date offer no option to track animator (or observer) kinematics at sufficient sampling rates to reliably make inferences about the role of movement similarity.
Here we created a novel animations database (available upon request) by asking 51 members of the general population to animate two triangles to depict mental-(mocking, seducing, surprising) and non-mental-(following, fighting) state interactions on a 133 Hz touch screen device. Subsequently an independent sample of 37 members of the general population watched a selection of videos from our new database. To ensure that participants were exposed to a wide range of kinematics they watched 8 exemplars, for each word, ranging from slow to fast speed. Participants rated the extent to which each animation depicted the words mocking, seducing, surprising, following and fighting, in addition to also creating their own animation for each word (Fig. 1). In a three-step analysis procedure, we first used Bayesian mixed effects models to test our hypotheses that kinematic jerk and the similarity in kinematics between observer and animator are significant predictors of the accuracy of mental state attributions (confirmatory analysis). In a second step, we used Fast Fourier Transform (FFT) combined with bootstrapped F-tests to investigate whether mocking, seducing, surprising, following and fighting animations could be reliably distinguished according to the profile of spectral density across a range of frequency bands (exploratory analysis 1). Finally, we employed random forest analysis to determine the relative contribution to accuracy of a multiplicity of factors including speed, acceleration, jerk, the amount of simultaneous movement of both triangles, the relative distance between triangles, triangles' average rotation and the magnitude of spectral density in the frequency bands identified in the second analysis step (exploratory analysis 2).

Results
Accuracy for each trial was calculated by subtracting the mean rating for all non-target words from the rating for the target word (e.g., the target word was seducing on trials where the participant watched a video wherein the original animator had attempted to depict the triangles seducing each other). Consequently, a high, positive accuracy score for a seducing animation indicates that an observer rated this animation as depicting seducing to a higher extent than mocking, surprising, following or fighting. For a comparison of mean accuracy scores for each word category see Supplementary Materials. For each video that participants observed and for each animation that they created themselves, mean jerk magnitude (hereafter:

Jerk affects performance differently for mental-and non-mental state animations
In line with our hypothesis, accuracy was associated with mean jerk, furthermore jerk interacted with mental state: For mental state animations, lower mean jerk was associated with higher accuracy (  Prior and posterior probabilities of model parameters predicting accuracy Note. JerkDiff = jerk difference. For all regression coefficients, weakly informative priors were set as following a normal distribution centered at 0 with an SD of 10. state: mocking, seducing, surprising) as a predictor in addition to jerk and jerk difference.
These models revealed that, for non-mental state animations there was a strong negative effect of jerk for fighting, but not following, animations ( Higher observer-animator similarity in jerk is associated with higher accuracy only in

mental-state animations
In line with our hypothesis, accuracy was also associated with jerk difference, furthermore jerk difference interacted with mental state such that it was a significant predictor for mental, but not non-mental, state videos. That is, for non-mental state animations the mean of all posterior coefficients for jerk difference was centered near zero (

A combination of ten kinematic and spatial variables best predicts accuracy in the animations task
To investigate whether different triangle trajectories can reliably distinguish between the five target words (i.e., mocking, seducing, surprising, following, fighting) we used FFT to decompose the triangles' trajectories and represent them as an amplitude spectral density profile across a range of angular frequencies. To test for differences, between the five target words, in spectral density across the angular frequency range, bootstrapped F-tests (with 1000 boots) were performed (see Methods: Data Analysis and Processing). This analysis revealed nine significant clusters, defined as clusters of difference that occurred in less than 5% of comparisons with resampled distributions (see Figure 3A).
To examine whether spectral density in these nine frequency clusters was predictive of accuracy we used the maxima and minima of each significant cluster as bin edges and calculated the angular frequency spectral density (AFSD) as the area under the curve between the bin edges (cluster bin edges: 0. acceleration), jerk, simultaneous movement, relative distance and mean rotation, by means of a random forest model 23  Out of all 16 variables tested, 10 were confirmed important, two were confirmed unimportant, and four were classed as tentative on the basis that their permutation importance was not significantly different from the maximal importance of a shadow feature (see Fig 4). We subsequently conducted post hoc random forests separately for mental state-and non-mental state animations. These post hoc analyses revealed that, in mental state animations, five factors were predictive of accuracy, with jerk and acceleration being the most prominent predictors, followed by speed, which was ranked third (see Supplementary Fig 2). In addition, AFSD in bin 6 and simultaneous movement were classed as important in predicting accuracy.
In non-mental state animations, a total of eight predictors were identified as important variables, with mean rotation being ranked highest by a considerable margin. In addition to mean rotation, a combination of AFSD in bins 1, 6, 7 and 9, and acceleration, jerk and speed were identified as important features of non-mental state animations.  Random forest variable importances Note. Variable importances of all 16 features entered into the Boruta random forest, displayed as boxplots. Box edges denote the interquartile range (IQR) between first and third quartile; whiskers denote 1.5 * IQR distance from box edges; circles represent outliers outside of 1.5 * IQR above and below box edges. Box color denotes decision: Green = confirmed, yellow = tentative, red = rejected; grey = meta-attributes shadowMin, shadowMax and shadowMean (minimum, maximum and mean variable importance attained by a shadow feature).

Discussion
To better understand why some clinical groups find the animations task so challenging, this study evaluated the relative contribution of jerk, jerk similarity and other stimulus characteristics to mental state attribution performance. Our results confirm our hypothesis that kinematic jerk and movement similarity are predictors of the accuracy of mental state attribution. In addition, we highlight that stimulus features including the shape of the triangles' trajectories and the amount of rotation of the triangles can also affect the ease with which participants are able to appropriately label the target states.
In the first part of our three-step analysis, we found that mental state was the primary predictor of animations task performance. Mental state videos were strongly associated with lower accuracy, correspondingly non-metal state videos were rated more accurately. The In this first analysis step it was further revealed that the triangles' mean jerk in an animation plays a substantial role in interpreting that animation. For mental state attributions jerk was negatively predictive of accuracy, whereas for non-mental state animations jerk was positively predictive of accuracy. Post hoc analyses revealed that this latter result was primarily driven by fighting animations, and that the former was most notable with respect to mocking and surprising animations (though caution is advised since credible intervals of coefficient estimates did not exclude zero). In previous work, Edey and colleagues 9 observed that nonautistic participants were more accurate in their mental state attributions for animations generated by non-autistic participants compared to those generated by autistic participants.
They also observed that animations generated by autistic participants were more jerky compared to those generated by controls. However, in Edey et al.'s study there were a number of additional dimensions along which the two groups' animations may have varied, making it impossible to know whether the autistic participants' animations were difficult to interpret because of the jerky kinematics. Our results show that jerk meaningfully contributes to the accuracy of mental state attributions, thus our data supports the conclusion that jerk is highly likely to be one of the driving factors in the group differences observed by Edey et al.
Our results also highlight kinematic similarity as a potential driving factor of the differences observed by Edey et al. 9 . That is, we observed a positive relationship between kinematic similarity and accuracy. Post hoc analyses revealed that evidence of this relationship was particularly compelling in the case of mocking animations: The more closely a mocking animation's mean jerk approximated the participant's own jerk when animating the same word category, the more accurately that animation was rated. We speculate that Edey et al.'s nonautistic participants performed poorly when attributing mental states to animations produced by autistic individuals not only because these animations were jerky, but also because the kinematics of the animations were dissimilar from the way in which the observer would have produced the same animation.
The second aim of the current study was to explore the extent to which a range of other stimulus features, including trajectory shape, influence mental state attribution accuracy. To quantify trajectory shape we used FFT to decompose trajectories into spectral density in angular frequency bins. Animation identity could be differentiated by AFSD in nine bins and random forest analyses confirmed that four of these bins -bins 1, 6, For the third step in our three-part analysis, we employed random forests to ascertain the relative contribution to accuracy of a range of stimulus features. The random forest methodology was chosen for its robustness against (multi-)collinearity and suitability for evaluating contributions of a large number of variables with limited data points 26 . Our random forest analysis confirmed ten features as important predictors of accuracy. In order of relative importance these are: mental state, mean rotation, acceleration, jerk, trajectory shape (AFSD in bins 1, 6,8,9), simultaneous movement of the triangles and speed. Post hoc analyses (see Fig 3B) revealed that with respect to mental state attribution specifically, five of these features were of confirmed importance: jerk, acceleration, speed, AFSD-bin 6 and simultaneous movement. There was one feature which was uniquely important for mental state accuracy: The amount of simultaneous movement of blue and red triangles. By decomposing the animations task into features which predict accuracy, this random forest analysis deepens understanding of individual differences in animations task performance and raises testable empirical hypotheses for further research. For example, our analysis illustrates that simultaneous movement of the triangles is a stimulus feature which predicts mental state attribution accuracy. This observation raises the possibility that poor performance on the animations task in some clinical groups may be related to differences in processing this stimulus feature. That is, processing the simultaneous movement of the triangles requires

Animotion Online Task
We created a browser-based application that enables us to record and replay Following instructions, participants were presented with the first word and a 30-secondlong presentation of the stationary starting frame, allowing participants to plan their subsequent animation of that word. Finally, individuals were given 45 seconds to animate the given word.
This process was repeated for the total of five words (mocking, seducing, surprising, following, fighting) and on each trial participants were given the option to discard and repeat their animations if they were unhappy with the result. Only the final animations were analyzed.

Stimulus Selection
Our procedure resulted in a total of 255 animations (51 for each word), recorded at a frame rate of 133 frames / second. Animations were then visually inspected for sufficient length and movement coverage of more than two quadrants of the screen. 53 animations failed these quality control checks. The final stimulus set comprised 202 animations (42 mocking, 38 seducing, 36 surprising, 44 following, 42 fighting).  (1). An a priori power analysis of the complete study was not performed due to the lack of applications available to estimate effect sizes for the present analyses (a mixed effects model with more than one fixed effect). Participants received either course credit or money (£8 per hour) for their participation. None of the participants had previously taken part in stimulus development.

Task
The Ratings Collection phase comprised two tasks. First, all participants carried out a production task, where they created one 45-second-long animation for each of the five target words mocking, seducing, surprising, following and fighting, as described above. for a word (see Figure 5). Thus, for each word, each participant viewed a selection of animations such that they were exposed to the full range of kinematic variation in the population used to create the stimulus pool.
Finally, after watching each animation, participants were asked to rate on a visual analogue scale ranging from one to ten the extent to which they perceived the video to display the target word (e.g., mocking) and each of the four non-target words (e.g., seducing, surprising, following and fighting). The whole process of creating five and viewing and rating 40 45-second animations lasted between 40 and 50 minutes. Task order was fixed (production

Figure 5
Example of stimulus selection method.
Note. A) Example of the stimulus selection method for the word mocking. The selection method was the same for all five word categories. From each of eight percentile bins of the speed frequency distribution for a word category, one animation was selected at random and replayed to the participant. B) Schematic depiction of 3 successive trials in the perception task: Each animation was followed by a separate screen with five visual analogue sliding scales (one for each of the five word categories), ranging from 1 to 10.
task then perception task) to allow participants' animations to be unaffected by the animations they would see in the perception task. Due to the upper limit on the WACOM monitor refresh rate, videos were created with a 133 Hz sampling rate and displayed at 60Hz.

Procedure
Individuals sat in front of the WACOM Cintiq 22 HD touch screen, tilted at 30 degrees, and first completed a practice phase in which they familiarized themselves with moving the triangles around the screen. They were then instructed that they would first create an animation for each of the five words themselves (instructions were the same as in Building the animotions database; see Supplementary Materials) and subsequently would view and rate animations which had been created by other people. Participants then completed the production and perception tasks as described above.

Data Analysis and Processing
All data was processed in MATLAB R2020a 35 and analyzed in R 36 . Code required to reproduce data analysis and figures for this study will be freely available under (https://osf.io/pqn4u/).

Accuracy Ratings
Accuracy for each trial was calculated by subtracting the mean rating for all non-target words from the rating for the target word. Thus, a positive score indicates that the target word was rated higher than all non-target words, with higher accuracy scores reflecting better discrimination between target and non-target words. See Appendix 1 for further analysis of accuracy scores.

Spatial and Kinematic Predictors
All variables were calculated from positional data derived from the center points of the blue and red triangles. All steps of data processing mentioned below were performed on both the animations created by participants (= production data) and the animations from the full stimulus set used as perception task stimuli (= perception data).

Stimulus Kinematics
Instantaneous speed, acceleration magnitude and jerk magnitude were obtained by taking the first-, second-and third order non-null derivatives of the raw positional data, respectively (see [1], [2] and [3], where x and y represent x-and y positions of red and blue triangles in the cartesian coordinate system, , , and denote instantaneous velocity, acceleration and jerk, respectively, and denotes time).
As the 'diff' function in MATLAB amplifies the signal noise, which compounds for higher derivatives, we employed a smooth differential filter to undertake each step of differentiation (filter adopted from Huh & Sejnowski, 2015). The Euclidean norms of the x and y vectors of velocity, acceleration and jerk were then calculated to give speed, acceleration magnitude and jerk magnitude. That is, speed is calculated as the distance in pixels moved from one frame to the next. Acceleration magnitude comprises the change in speed from one frame to the next, and jerk magnitude comprises the change in acceleration. Mean speed, mean acceleration magnitude and mean jerk magnitude were then calculated by taking the mean across red and blue values, respectively. Lastly, kinematic values were converted from units of pixels/frame to mm/s.

Observer-Animator Kinematic Similarity
In order to measure the kinematic similarity between participants' and stimulus kinematics, absolute observer-animator jerk difference was calculated by first subtracting the mean jerk of each video a person rated from their own jerk values when animating the same word, and then taking the absolute magnitude of those values. Lower jerk difference values indicate higher observer-animator kinematic similarity.

Angular Frequency Spectral Density (AFSD)
For the purpose of quantifying animation trajectories, we adapted a method developed Because the FFT assumes an infinite signal, when addressing a finite sample such as the log angular speed here, the first and last values of each sample must be continuous to avoid artefacts in the FFT results. We addressed this and any general drift in the signal (e.g., from participants generally slowing their movements due to fatigue) by removing a second order polynominal trend. The area under the amplitude spectral density curve was normalized to allow like to like comparison between differing lengths of red and blue triangle movement within and across participants. Across red and blue triangles' trajectories a weighted mean was then taken by multiplying each AFSD value with a factor reflecting the proportion of curved movement available for a triangle before averaging. See Figure 6 for an example of an amplitude spectrum and the related trajectory path.

Further Spatial Variables
A variety of other variables were created to further quantify spatial aspects potentially affecting inferences from the animations. First, simultaneous movement was calculated as the proportion of all frames where both red and blue triangles' speed was greater than zero (as seen in [4]), reflecting simultaneous movement of both triangles in a video. Furthermore, relative distance -the average distance between red and blue triangles -was quantified by taking the mean of the square root of the absolute distances between the triangles' x and y coordinates, respectively (see [5]). Finally, mean rotation reflects the average rotation of blue and red triangles around their own axis, measured in angle degrees and weighted by proportion of movement present across all frames for each color ( [6]).

Data Analysis Overview
This study investigates the role of a large number of different predictor variables in explaining accuracy in the animations task. For two of these variables we present specific hypotheses (jerk, jerk difference); in addition, we wanted to investigate the role of a larger set of variables on an exploratory basis. For this reason, analyses were conducted in two stages: First, in a confirmatory stage, the roles of jerk and jerk difference were examined using Bayesian mixed models. Second, in an exploratory stage, a random forest model was performed, investigating the relative contribution of all predictor variables together.

Data Cleaning and Transformations
For all analyses, variables that were not normally distributed were either log-or squareroot transformed to approximate normal distribution. Any outliers, as defined by values exceeding three scaled absolute deviations from the median, were replaced with the respective lower and upper threshold values. Finally, all predictor variables were z-scored.

Confirmatory analysis
A Bayesian linear mixed effects model was fitted in R using the brms package 39 to evaluate the relative contribution of jerk and jerk difference to accuracy as a function of word category, as well as their three-way interaction. A maximal 20 random effects structure was defined, allowing the intercept, the predictors of interest and their interactions to vary by participants (subject ID) and items (animation ID). Jerk and jerk difference were entered as covariates and word category was entered as dummy coded factor. We used brms default priors for the intercept and the standard deviation of the likelihood function as well as weakly informative priors (following a normal distribution centered at 0 and SD = 10) for all regression coefficients. Each model was run for four sampling chains with 5000 iterations each (including 1000 warmup iterations). There was no indication of convergence issues for any of the models (all Rhat values = 1.00, no divergent transitions).

Exploratory analysis I
Bootstrapped F-tests were performed to test for differences, between the five target words, in the presence of angular frequencies at each of the 1001 points on the amplitude spectrum. Bootstrapping amplitude spectrum values 1000 times revealed nine significant clusters, defined as clusters of difference that occurred in less than 5% of comparisons with resampled distributions (see Fig. 3A). The maxima and minima of each significant cluster were then used as bin edges for calculating the amplitude spectral density as the area under the curve within those nine bins, for both red and blue triangles' trajectories in each animation (cluster relative to all other animations, the triangles in this video would be predominately moving with a speed and acceleration profile that lies between that of elliptical-and triangle trajectories.

Exploratory analysis II
Relative variable importance of 16 variables in predicting accuracy was assessed using random forest models 23 and the feature selection wrapper algorithm Boruta 24 . Random forests are ensembles of decision trees, where each tree is grown from a pre-specified subset of bootstrapped samples and where, for each tree, only a randomly selected subset of variables are considered as splitting variables. Boruta makes use of the ranger package 40 to train a random forest regression model on all variables as well as their permuted copies -so called "shadow features". First, normalized permutation importance (scaled by standard error, see 23 ) of all features is assessed. Permutation importance of a given variable is the reduction in prediction accuracy (mean decrease in accuracy, MDA) of the model when this variable is randomly permuted. A variable is then classed as important when the Z-score of their importance measure is significantly higher than the highest importance Z-score achieved by a shadow feature. Overall performance of the model was evaluated by fitting a random forest with the ranger package with 500 trees and 10 random variables per tree.
Due to the known correlational structure within the data and the present lack of random forest models which can account for random effects, this analysis was performed items-based.
For this purpose, for every variable, values corresponding to the same item were averaged across subjects, resulting in a total of 202 data points. Note that, due to the reliance on betweensubject variance of variables relating to own-stimulus kinematic difference, these variables were excluded from this analysis.