Participants
Adults
Thirty-eight adult participants with autism (21 male; mean age = 27.65) and 58 neurotypical (NT) adults (36 male; mean age = 32.69) were included in the study. All adult participants were between the ages of 18 and 59 years and achieved full scale IQ scores of 70 as measured by the Wechsler Abbreviated Scale of Intelligence-Second Edition57 (WASI-II). Autism diagnoses were confirmed by the clinical judgment of a licensed psychologist specializing in the assessment of autism, supported by research-reliable administration of the Autism Diagnostic Observation Schedule-258 (ADOS-2).
Exclusion criteria for both groups included the presence of other neurological and genetic disorders, non-autism related sensory impairments (e.g., uncorrected visual or hearing impairments), and substance/alcohol abuse or dependence during the past two years. Further, individuals in the NT group were excluded if they had reported a previous psychiatric history, cognitive or sensory impairment, use of psychotropic medications, or clinically elevated scores on the Social Communication Questionnaire59. Individuals with autism and co-occurring ADHD, anxiety, or depression were included, while those with other recent psychiatric diagnoses within the past five years or co-occurring neurogenetic syndromes were excluded. Ten autistic adults (26%) reported taking antidepressant medications.
Children/Adolescents
Forty-five autistic children/adolescents (35 male; mean age = 11.53) and 43 neurotypical (NT) children/adolescents (34 male; mean age = 11.86) were included in the study. All child/adolescent participants were between the ages of 8 and 17 years and achieved full scale IQ (FSIQ) scores of 70 as measured by the WASI-II. Autism diagnoses were confirmed by the clinical judgment of a licensed psychologist specializing in the assessment of autism, supported by research-reliable administration of the ADOS-2 and, when available, parent interviews that included algorithm items from the Autism Diagnostic Interview, Revised60 (ADI-R).
Exclusion criteria for children/adolescents were similar to those for adults with some additional considerations. Mainly, for children and adolescents, behavior and co-occurring psychiatric conditions were screened for using parent and guardian reports. As in the adult sample, ADHD, anxiety and depression were not exclusionary for children with autism. In total, ten autistic participants’ parents (22%) reported their child taking antidepressant medications.
Ethical Considerations
The study was conducted in accordance with the Declaration of Helsinki and all participants were compensated $20 per hour of their time following each session. Written informed consent or assent forms were signed by all participants, while informed consent was obtained from parents or guardians of minors. All methods and procedures were approved by the Institutional Review Board for human subjects at Vanderbilt University Medical Center and carried out in accordance with relevant guidelines and regulations on ethical human research.
Measures
The Social Responsiveness Scale–Second Edition61 (SRS-2) was used to measure autistic traits dimensionally across the full sample. Adult participants in both diagnostic groups completed the SRS-2 adult self-report form, whereas parents or guardians of children/adolescents in both groups completed the analogous caregiver-report questionnaire, the SRS-2 School Age form. To facilitate comparison across the different groups, the SRS-2 total scores were converted to T- scores (M = 50, SD = 10).
Empathy
Empathy was assessed multi-dimensionally using an adapted version of the Multifaceted Empathy Test, the MET-J28, a validated performance-based test that separates cognitive and emotional empathy based on responses to emotional faces presented with context in the background. The original MET includes 50 still images depicting emotionally charged facial expressions of 25 positive (e.g., joy, happiness) and 25 negative (e.g., sadness, anger) emotions. The adapted MET-J version used in the present study included only 16 images each for positive and negative valence. The photographs are taken from the International Affective Picture System62 (IAPS), a well-validated database of photographs designed for standardized emotion and attention testing. On each trial, participants viewed an emotional image and were first asked to rate their level of arousal, followed by explicit emotional empathy ratings, and a cognitive empathy (i.e., emotion recognition) multiple choice question. Figure 3. depicts an example trial on the task, recreated using a free-use stock image from the Canva.com image database.
As described by Dziobek et al. (2008), to minimize demands of self-reflection and thereby also mitigate social desirability bias, we included an implicit assessment of emotional empathy by asking participants to rate how calm/aroused the emotional stimuli made them feel using the Self-Assessment Manikin (SAM). The SAM is a visual-analogue scale providing scores ranging from 1 (very calm) to 9 (very aroused). Thus for each picture, participants were asked (1) “How excited does this picture make you” (implicit emotional empathy; subsequently described as arousal empathy); (2) “While looking at the picture, how much do your feelings match the X’s feelings (emotional empathy; EE) measured on a visual Likert scale (1-9); and finally (3) “How does this X feel?” (cognitive empathy; CE). Here an “X” represents the noun used to describe the individuals (boy/girl/man/woman) in the image, who varied across trials. Each trial ended with a final presentation of the emotional stimulus that provided feedback for the cognitive empathy question by displaying the correct emotion label from among the four choices. Note that this order and wording for EE surveys are slightly different from the original MET and MET-J, which provided feedback on CE surveys prior to presenting explicit emotional empathy surveys. We adapted this order to ensure that EE and arousal responses were made as reflexively as possible to the perceived emotion upon initial presentation, rather than being adjusted based on CE feedback. All stimuli were presented as slides of variable duration (ad libitum) in random order on a black screen.
Statistical Analyses
Differences in demographics (e.g., age, sex, VIQ, PIQ) and SRS-2 scores were compared between the autism and NT groups within a Bayesian framework. When the outcome of interest was categorical (e.g., correct or incorrect emotion recognition), group differences were examined using a Bayesian analogue of the Pearson chi-squared test63,64. When the outcome of interest was a continuous variable (e.g., age), we examined mean differences using a Bayesian analogue of the Welch (unequal-variances) t-test65. Effect sizes from each of these tests (i.e., Cohen’s d and the odds ratio [OR]) were summarized as the posterior median and 95% highest-density credible interval (CrI). Additionally, for all group comparisons, evidence for or against the point null hypothesis (; i.e., no differences between groups) was quantified with a Bayes factor64,66, defined as the ratio of how likely the data are under the alternative hypothesis (; i.e., the difference between group is nonzero) divided by how likely the data are under . In concordance with widely-used guidelines on Bayes factor interpretation67,68, we considered BF10 values > 3 as indicating substantial evidence for , BF10 values < 0.333 as indicating substantial evidence for , and BF10 values between 0.333 and 3 as providing inconclusive and only “anecdotal” evidence for or . All group comparisons were performed in the R statistical computing platform using open-source R code written by author ZJW69. Additional details on the specifics of the models underlying Bayesian t-test and Chi-squared test analogues are presented in Supplemental Methods.
To determine the effects of various predictor variables on arousal, emotional, and cognitive empathy while controlling for possible covariates, we used R70 to analyze the data at the single-trial level using hierarchical Bayesian modeling. Trial-level MET data for arousal, emotional, and cognitive empathy were analyzed using (generalized) linear mixed effects models ([G]LMEMs), which allowed us to model the correlations between responses derived from the same participants as well as the same stimuli71. LMEMs were used to model arousal and emotional empathy, as the 9-item scale used to derive these outcomes had enough points to be approximated as a continuous variable72. However, we used a logistic GLMEM to model cognitive empathy, as individual trial data from this part of the task consisted of binary “correct/incorrect” responses. The baseline [G]LMEM for each MET-derived outcome included fixed effects of age group (child vs. adult), sex, autism diagnosis, and emotional valence (positive vs. negative), as well as random intercepts for participant and stimulus (see example below for CE, Equation 1). Random slopes were also included in this baseline model for all subject-level predictors, allowing the effects of age group, sex, and autism status to vary by stimulus. The decision to treat age as categorical in the BMA was driven by the finding that performance on the CE task increased with age throughout childhood, reaching an asymptote at approximately age 18–20, thereby indicating a difference between children and adults rather than a true linear age trend.

For each of the three outcomes, we additionally determined if several other predictors beyond the baseline model contributed to task performance, including the two-way and three-way interactions between age, diagnosis, and valence; verbal IQ (VIQ); performance IQ (PIQ); and overall level of autistic traits (SRS-2 T-score). In order to determine whether any given predictor should be added to the baseline model, we fit candidate models that included all combinations of potential predictors (n = 40 potential models including the baseline). Then, using bridge sampling73, we calculated marginal likelihood of each candidate model, deriving posterior model probabilities in a manner equivalent to the process of Bayesian model averaging74. The model with the highest posterior probability was considered the final model for each outcome. Using these model weights, we also computed inclusion Bayes factors74 (BFinc), allowing us to determine the degree of evidence for or against the inclusion of each predictor in the model. Inclusion Bayes factors are interpretable in the same manner as BF10, with being the exclusion of the variable from the model and being the inclusion of the variable in the model.
Once the final model for each outcome was selected, we additionally tested all regression slopes in a Bayesian framework, using the 95% CrI to determine whether each slope was likely to be nonzero in magnitude. If the full 95% CrI excluded zero, we rejected the point null hypothesis that the effect was exactly zero. However, as this point null hypothesis is always false at the population level75, we also tested these effects for practical significance76. Within the Bayesian hypothesis testing framework, this was done by defining a region of practical equivalence77 (ROPE), an interval of parameter values considered small enough to be equivalent to zero in practice (in this case for linear models and for logistic models). Evidence both for and against the true parameter value falling within the ROPE can be quantified by calculating a ROPE Bayes factor (BFROPE), defined as the odds of the prior parameter distribution falling within the ROPE divided by the odds of the posterior effect size distribution falling within the ROPE78,79. These Bayes factors can be interpreted on the same scale as previously discussed for BF10andBFinc67,68. In the case that a parameter was nonzero or a given variable was included within the final model but the BFROPE value was smaller than 0.333, we considered this variable as not predicting the MET outcome of interest to a practically meaningful extent. Lastly, in order to assess the predictive power of the final model, we calculated the Bayesian R2 coefficient proposed by Gelman et al.80
All Bayesian [G]LMEMs were fit in Stan using the brms R package81,82 with weakly informative priors, including Normal(0, 1) priors on all (standardized) regression slopes and intercept terms, as well as default half-Student t3(0, 2.5) priors on the standard deviation of each random slope or intercept term. Model parameters were estimated via Markov chain Monte Carlo (MCMC) using the No U-turn Sampler implemented in Stan83, with posterior distributions of each parameter estimated using 21,000 post-warmup MCMC draws from seven Markov chains (14,000 in cases where missing data were present). Parameter summaries from these posterior distributions were operationalized as the posterior median and the 95% CrI. Convergence for each model was confirmed by examination of Markov chain trace plots, as well as values of the Gelman–Rubin (Rubin & Gelman, 1992) convergence diagnostic < 1.01. Missing data were handled using five-fold multiple imputation based on the random forest imputation algorithm implemented in the missForest R package84,85.