This study was preregistered on OSF: https://osf.io/wtm3q. We added the interaction hypothesis (H3) to the preregistered hypotheses. Concerning the analysis, we applied a Bayesian framework to allow for interpretation of the evidence for and against a hypothesis. Stimulus evaluation, preprocessing and analysis were performed with R 4.2.2 [41] in RStudio 2022.12.0 [42] and Python 3.9.13 in Spyder 5.3.3 [43]. All code used in the analysis of the stimuli, the pilot and the experimental data is available on OSF (https://osf.io/whgx6/).
Participants
We aimed for a sample size of 195 non-autistic participants. The sample size was determined with a simulation-based power analysis as implemented in the R package mixedpower [44] using pilot data from 37 participants. We determined a threshold of 2 and 1,000 simulations to achieve 90% power for detecting an effect of interpersonal synchrony of motion energy on impression formation. Inclusion criteria were age between 18 and 60 years, no psychiatric or neurological diagnoses, normal or corrected-to-normal vision and informed consent. Exclusion criteria were high scores on the short form of German translation of the Autism Quotient questionnaire (above 5 of 10 points, AQ-10 [45]) and low scores on a verbal intelligence test, the Wortschatz-Test (below 6 of 42 points, WST [46]). Additionally, we asked participants in a post-experimental debriefing questionnaire whether they have answered the questions conscientiously. They could either choose “Yes - my answers can be used for research without any problems” or “No - my answers should rather not be used”. This gave participants who were distracted or chose random ratings the opportunity to indicate that their data should not be used in the study. If they indicated that their data should not be used, their data was excluded from the analysis. We continuously preprocessed collected data along all inclusion and exclusion criteria. In total, we collected data from 247 participants of which one dataset was excluded because the participant advised against it and one because of a low WST score. Additionally, 49 participants scored above five in the AQ-10 (age: 18 to 59 years old, mean = 25.00 ± 4.59; 35 female). Therefore, 196 participants (age: 18 to 59 years old, mean = 26.56 ± 6.97; 142 female) were included to test H1, H2 and H3. For H4, we had to exclude additional participants due to insufficient gaze data quality, resulting in a sample of 91 participants (age: 19 to 59 years old, mean = 25.62 ± 5.93; 61 female). For more details on the sample, refer to the sample description in the supplementary materials. Participants were compensated for their participation with 10€ or course credit. This study was conducted following the Declaration of Helsinki and approved by the Ethics committee of the Medical Faculty, LMU Munich.
Experimental procedure
Data collection was conducted in German using the Gorilla Experiment Builder on GORILLATM (www.gorilla.sc), an online platform to create experiment designs [47]. The experiment consisted of study information, a consent page, demographic questions, AQ-10, WST, videos with their associated ratings (impression formation task) and a post-experiment debriefing questionnaire. In the impression formation task, participants saw 44 ten-second-long videos of two people interacting with each other. The videos did not contain any audio, and only outlines of the interactants were shown (Figure 2). The outlines on half of the screen were coloured in green to indicate the target interactant, on whom participants should base their ratings. Each video was followed by six ratings on a scale from 0 (not at all) to 100 (very much): intelligent, awkward, likeable, trustworthy, Would you start a conversation with the green person? (abbr.: conversation) and Do you think the green person has many friends? (abbr.: friends; for the original German versions see Figure 2). All but the last question were taken from Sasson et al. [12,15]. Participants had to respond to each rating to continue to the next trial. During the presentation of the videos, webcam-based eye tracking was used to measure gaze patterns. First, a calibration was performed. GORILLATM uses support vector machines to track the participant’s face and, specifically, their eyes with a frequency of 60Hz. For each ten-second-long video, this should yield 600 samples under optimal conditions. Then, two metrics were calculated: the proportion of fixations on each of the screen halves and switches between the two halves.
Stimulus creation
Stimuli were created from videos collected in previous studies investigating the influence of diagnostic status on interpersonal synchrony of motion energy in dyadic interactions. Specifically, 24 videos were excerpts from Georgescu et al. [36], and 20 videos were excerpts from Koehler et al. [37]. Half of the videos showed two non-autistic interactants, in which one of the two interactants was highlighted green as the target to be judged. The other half of the videos presented mixed interactions between one autistic and one non-autistic interactant. In these videos, it was always the autistic interactant who was highlighted in green to be rated. Therefore, the design was balanced with regard to the independent variable diagnostic status (autistic, non-autistic).
Videos were muted and visually reduced to the outlines of the interactants in MATLAB R2020b (Natick, Massachusetts: The MathWorks Inc; Figure 2). First, a Gaussian filter with σ = 2.0, as implemented in the imgaussfilt function, was applied to the greyscale versions of the videos in a frame-wise fashion to remove personally-identifying details that could be used to identify individuals. In two videos showing the same dyad, a Gaussian filter of σ = 2.5 was used which made the outlines slightly rougher but was hardly distinguishable from σ = 2.0 for the observer. Then, the edge function was used to determine edges in the filtered frames, so that only a rough outline of the individuals featured in the original video remained in the stimulus material. Last, one-half of the frames were coloured in green before exporting the edited video versions that finally were shown in the experiment. The green colouring indicated the target interactor who was to be rated.
Second, all videos were analysed using Motion Energy Analyses (MEA) [33]. We focused on head motion because the videos from Koehler et al. [37] included one person holding a clipboard, thereby artificially restricting motion in the upper body and arms. Resulting MEA values were further preprocessed in rMEA [48]. From each of the five-minute-long videos, two ten-second-long excerpts were chosen: one from 90 to 100 seconds, representing the introductory phase, and the other from 240 to 250 seconds, representing the body of the conversation. Outliers from the head region were removed using the rMEA::MEAoutlier function, which defines outliers as values more than ten times the standard deviation. Then, they were scaled using the rMEA::MEAscale function. Lastly, we computed lagged cross-correlations of the interactants’ MEA values within the same dyad using the rMEA::MEAccf function. We used pseudosynchrony to ensure that this measure captures interpersonal synchrony [49]. We evaluated 1) whether there is time dependency and 2) whether there exists synchrony in 10-second sections of the full videos as proposed by Moulder et al. [32]. Bayesian one-sample t-tests, as implemented by the BayesFactor::ttestBF function, revealed extreme evidence in favour of 1) and strong evidence in favor of 2), each based on 1,000 iterations of pseudosynchrony per dyad (for details see the supplementary materials).
This investigation showed that lagged cross-correlation was an effective measure of interpersonal synchrony of motion energy in this sample. Additionally, lagged cross-correlations not only provide an estimate of the total interpersonal synchrony but also deconstructs the contributions of the two interactants by considering either lags where the target interactant is leading their interaction partner (Green leading) or vice versa (White leading). This approach has been successfully used to investigate relationship quality in the therapeutic context [34]. Since our participants were asked to rate only one interactant in each video, we decided to use these estimations of each interactant leading their counterpart as predictors in our analysis, rather than an interpersonal synchrony score describing the coordination of both interactants with each other. We log-transformed both leading scores to achieve normal distribution, which allows us to scale all our predictors.
Concerning the gaze patterns, we first excluded all trials where the face of the participant was tracked with 50% accuracy or less (value recommended by Gorilla Support, 2022). Additionally, we set the minimum threshold for fixation duration to 50 ms. Trials with less than 400 samples were excluded from the analysis. One hundred and five participants with less than 50% of the trials left were excluded from the analysis of the gaze patterns resulting in a sample of 91 participants.
Analysis
We used a combination of Bayesian linear mixed models, as implemented in the brms package [51], and Bayesian t-tests, as implemented in the BayesFactor package [52]. For all random effects, we followed the guidelines by Barr et al. [53,54]. The Bayesian linear mixed models were run with four Markov chains with a total of 10,000 iterations each (50% warm-up). To draw conclusions on the significance of estimated parameters and differences, we used the brms::hypothesis function and Jeffrey’s scheme to interpret Bayes Factors [55]. The brms::hypothesis function computes an evidence ratio and a posterior probability under the hypothesis against its alternative. We chose one-sided testing in the direction of the respective estimate and adjusted α to 2.5%, since we preregistered non-directional testing with α = 0.05. Therefore, all parameter estimates with a posterior probability of above 97.5% for non-directional hypotheses (H1, H2, H3) and 95% for the directional hypothesis (H4) were considered significant.
For H1, H2 and H3, we averaged the six ratings to create a composite impression score. Before computing the average, we reversed the the awkwardness rating, since here higher values correspond with a more negative impression. The unidim function of the psych package revealed a high factor fit of fa.fit = 0.98 suggesting that all ratings measure the same underlying concept. We entered the impression score into a Bayesian linear mixed model with three fixed effects of interest: diagnostic status (autistic, non-autistic), leading of the green target interactant (Green leading) and leading of the white non-target interactant (White leading). We also added three regressors of no interest: overall motion estimates of both interactants and the source of the videos [36,37]. All parametric predictors were scaled to allow for comparison of the estimates. Lastly, we added random intercepts for stimulus and participants, as well as random slopes for diagnosis and video source for the participants. We used treatment contrast coding [56], with non-autistic interactants and videos from Georgescu et al. [36] as the reference in relation to which all effects are evaluated.
Concerning the gaze patterns, we first used Bayesian paired t-tests to check whether participants focused on the green partner of the video led to increased fixation times. Then, we tested H4 by comparing fixation durations, as well as the number of switches per 100 samples, between the halves of the videos showing autistic and non-autistic targets for impression judgments. For each outcome variable, the interquartile method was used to detect and remove outliers.
For our explorative analyses, we performed similar Bayesian linear mixed models as described for testing H1, H2 and H3. We added binarised AQ-10 scores as a categorical predictor (0 = low, 1 = high) to the original model predicting the impression score to explore differences between participants with high and low autism-like traits. Additionally, we ran six Bayesian linear mixed models of the same structure for each of the six ratings to determine which nuanced aspects of impression formation are influenced by diagnostic status and the interactants leading one another.