The current investigation was a secondary data analysis of TAS-20 responses collected as a part of multiple online survey studies (See “Participants” section for more details on each study). Participants reporting professional diagnoses of autism spectrum disorder were recruited from the Simons Foundation Powering Autism Research for Knowledge (SPARK) cohort, a U.S.-based online community that allows autistic individuals and their families to participate in autism research studies (85). In order to compare TAS scores and item responses between autistic and non-autistic individuals, we combined the SPARK sample with open data from the Human Penguin Project (86,87), a large multinational survey study investigating the relationships between core body temperature, social network structure, and a number of other variables (including alexithymia measured using the TAS) in adults from the general population. The addition of a control group provides a substantial amount of additional information, allowing us to assess I-DIF across diagnostic groups, assess the psychometric properties of any newly-created TAS short forms in the general population, and generate normative scores for these short forms based on the distribution of TAS scores in this sample. Although autism status was not assessed in the control sample, the general population prevalence of approximately 2% autistic adults (88) does not cause enough “diagnostic noise” in an otherwise non-autistic sample to meaningfully bias item parameter estimates or alter tests of differential item functioning (79).
Participants
SPARK (Autism) Sample
Using the SPARK Research Match service, we invited autistic adults between the ages of 18 and 45 years to take place in our study via the SPARK research portal. All individuals self-reported a prior professional diagnosis of autism spectrum disorder or equivalent condition (e.g., Asperger syndrome, PDD-NOS). Notably, although these diagnoses are not independently validated by SPARK, the majority of participants are recruited from university autism clinics and thus have a very high likelihood of valid autism diagnosis (85). Furthermore, validation of diagnoses in the Interactive Autism Network, a similar participant pool now incorporated into SPARK, found that 98% of registry participants were able to produce valid clinical documentation of self-reported diagnoses when requested (89). Autistic participants in our study completed a series of surveys via the SPARK platform that included the TAS-20, additionally providing demographics, current and lifetime psychiatric diagnoses, and scores on self-report questionnaires measuring autism severity, quality of life, co-occurring psychiatric symptoms, and a number of other clinical variables (see “Measures” section for descriptions of the questionnaires analyzed in the current study). These data were collected during winter and spring of 2019 as part of a larger study on repetitive thinking in autistic adults (project number RM0030Gotham), and the SPARK participants in the current study are a subset of those described by Williams et al. (79). Participants received a total of $50 in Amazon gift cards for completion of the study. A total of 1,012 individuals enrolled in the study, 743 of whom were included in the current analyses. Participants were excluded if they (a) did not self-report a professional diagnosis of autism on the demographics form, (b) did not complete the TAS-20, (c) indicated careless responding as determined by incorrect answers to two instructed-response items (e.g., Please respond ‘Strongly Agree’ to this question.), or (d) answered “Yes” or “Suspected” to a question regarding being diagnosed with Alzheimer’s disease (which given the age of participants in our study almost certainly indicated random or careless responding). All participants gave informed consent, and all study procedures were approved by the institutional review board at Vanderbilt University Medical Center.
Human Penguin Project (General Population) Sample
Data from a general population control sample was derived from an open dataset generated from the Human Penguin Project (HPP) (86,87), a multinational survey study designed to test the theory of social thermoregulation (90). Because the full details of this sample have been reported elsewhere (86,87), we provide only a brief overview, focusing primarily on the participants whose data were utilized in the current study. The HPP sample was collected in two separate studies in 2015–2016: one online pilot study (N = 232) that recruited participants from Amazon’s Mechanical Turk and the similar crowdsourcing platform Prolific Academic (91,92) and a larger cross-national study (12 countries, total N = 1523) that recruited subjects from 15 separate university-based research groups. In order to eliminate problems due to the non-equivalence of TAS items in different languages, we used only those data where the TAS-16 was administered in English (i.e., all crowdsourced pilot data, as well as cross-national data from the University of Oxford, Virginia Commonwealth University, University of Southampton, Singapore Management University, and University of California, Santa Barbara). Additionally, in order to match the HPP and SPARK samples on mean age, we excluded all HPP participants over the age of 60. Notably, individuals aged 45–60 were included due to the relative excess of individuals aged 20–30 in the HPP sample, which caused the subsample of 18–45-year-old HPP participants to be several years younger on average than the SPARK sample. The final HPP sample thus consisted of a total of 721 English-speaking adults aged 18–60 (MTurk n = 122; Prolific n = 84; Oxford n = 129; Virginia n = 148; Southampton n = 6; Singapore n = 132; Santa Barbara n = 100). As a part of this study, all participants completed a 16-item version of the TAS (TAS-16) that excludes four TAS-20 items (16, 17, 18, and 20) on the basis of poor factor loadings in the psychometric study of Kooiman et al. (64). In addition to item-level data from the TAS-16, we extracted the following variables: age (calculated from birth year), sex, and site of recruitment. The HPP was approved under an “umbrella” ethics proposal at Vrije Universiteit, Amsterdam, and separately at each contributing site. All study procedures complied with the ethics code outlined in the Declaration of Helsinki.
Measures
Toronto Alexithymia Scale (TAS)
The TAS (2,33) is the most frequently and widely used self-report measure of alexithymia, as well as the most commonly administered alexithymia measure in the autism literature (3). The most popular version of this form, the TAS-20 has been used in medical, psychiatric, and general-population samples as a composite measure of alexithymia for over 25 years (2), and this form has been translated into over 30 languages/dialects. The TAS-20 contains twenty items rated on a five-point Likert scale items from Strongly Disagree to Strongly Agree. The TAS-20 is organized into three subscales, Difficulty Identifying Feelings (DIF; 7 items), Difficulty Describing Feelings (DDF; 5 items), and Externally-oriented Thinking (EOT; 8 items), corresponding to three of the four components of the alexithymia construct defined by Nemiah, Freyberger, and Sifneos (1). Notably, the fourth component, Difficulty Fantasizing (DFAN), was also included in the original 26-item version of the TAS (34), but this subscale showed poor coherency with the other three and was ultimately dropped from the measure (2). The sum of items on the TAS-20 is often used as an overall measure of alexithymia, and scores of 61 or higher are typically used to create binary alexithymia classifications in both general population and clinical samples.
As noted earlier, neurotypical participants in the HPP sample filled out the TAS-16, a version of the TAS-20 in which four problematic items have been removed from the scale (64). However, as we wished to compare total scores from the TAS-20 between HPP and SPARK samples, we conducted single imputation for missing items in both groups using a random-forest algorithm implemented in the R missForest package (93–95). Such item-level imputation allowed for us to approximate the TAS-20 score distribution of the HPP participants, including the proportion of individuals exceeding the “high alexithymia” cutoff of 61. Notably, although the “high alexithymia” cutoff is theoretically questionable given the taxometric evidence for alexithymia as a purely dimensional construct (2), we chose to calculate this measure to facilitate comparisons with prior literature that primarily reported the proportion of autistic adults exceeding this cutoff (3). To further validate the group comparisons derived from these imputed data, we additionally calculated prorated TAS-16 total scores by taking the mean of all 16 items administered to all participants, which was subsequently multiplied by 20 for comparability with the TAS-20 total score. These scores were then compared between groups, and the proportion of individuals in each group with prorated scores ≥61 was also compared to the proportions derived from (imputed) TAS-20 scores.
Clinical Measures for Validity Testing
In addition to the TAS-20, individuals in the SPARK sample completed a number of other self-report questionnaires, including measures of autism symptomatology, co-occurring psychopathology, trait neuroticism, and autism-related quality of life. Measures of autistic traits included the Social Responsiveness Scale–Second Edition (SRS-2) total T-score (96) and a self-report version of the Repetitive Behavior Scale–Revised (RBS-R) (97,98), from which we derived measures of “lower-order” and “higher-order” repetitive behaviors (i.e., the Sensory Motor [SM] and Ritualistic/Sameness [RS] subscales reported by McDermott et al. (97)). Depression was measured using autism-specific scores on the Beck Depression Inventory–II (BDI-II) (79,99), and we additionally used BDI-II item 9 (Suicidal Thoughts or Wishes) to quantify current suicidality. We additionally assessed generalized and social anxiety using the Generalized Anxiety Disorder–7 (GAD-7) (100) and Brief Fear of Negative Evaluation Scale–Short Form (BFNE-S) (101,102), respectively. Somatization was quantified using a modified version of the Patient Health Questionnaire–15 (PHQ-15) (103,104), which extended the symptom recall period to three months and excluded the two symptoms of dyspareunia and menstrual problems. We measured trait neuroticism using ten items from the international personality item pool (105), originally from the Multidimensional Personality Questionnaire’s “Stress Reaction” subscale (106) and referred to here as the IPIP-N10. Lastly, autism-related quality of life was measured using the Autism Spectrum Quality of Life (ASQoL) questionnaire (107). More in-depth descriptions of all measures analyzed in the current study, including reliability estimates in the SPARK sample, can be found in the Supplemental Methods.
Statistical Analyses
Confirmatory Factor Analysis and Model-based Bifactor Coefficients
All statistical analyses were performed in the R statistical computing environment (108).
In order to test the appropriateness of the proposed TAS-20 factor structure in autistic adults, we performed a confirmatory factor analysis (CFA) on TAS-20 item responses in our SPARK sample. The measurement model in our CFA included a bifactor structure with one “general alexithymia” factor onto which all items loaded, as well as four “specific” factors representing the three subscales of the TAS-20 and the common method factor for the reverse-coded items (70). In addition, given the previously identified problems with the EOT subscale and the reverse-coded items (2), we additionally examined a bifactor model fit only to the forward-coded DIF and DDF items, removing both the EOT and reverse-coded items. Although not the focus of the current investigation, we also fit the original and reduced TAS factor models in the HPP sample in order to determine whether any identified model misfit was present only in autistic adults or more generally across both samples. We fit the model using a diagonally weighted least squares estimator (109) with a mean- and variance-corrected test statistic (i.e., “WLSMV” estimation), as implemented in the R package lavaan (110). Very few of the item responses in our dataset contained missing values (0.16% missing item responses in the SPARK sample, no missing TAS-16 data in HPP sample), and missing values were singly imputed using missForest (93–95).
Model fit was evaluated using the Chi square test of exact fit, comparative fit index (CFI; 111), Tucker-Lewis index (TLI; 112), root mean square error of approximation (RMSEA; 113), standardized root mean square residual (SRMR; 114), and weighted root mean square residual (WRMR; 115,116). The categorical maximum likelihood (cML) estimator proposed by Savalei (117) was used to calculate the CFI, TLI, and RMSEA, as these indices better approximate the population values of the maximum likelihood-based fit indices used in linear CFA than analogous measures calculated from the WLSMV test statistic (118). Moreover, the SRMR was calculated using the unbiased estimator (i.e., SRMRu) proposed by Maydeu-Olivares (119, see also 120) and implemented in lavaan for categorical estimators. CFIcML/TLIcML values greater than 0.95, RMSEAcML values less than 0.06, SRMRu values less than 0.08, and WRMR values less than 1.0 were defined as indicating adequate global model fit, based on standard rules of thumb employed in the structural equation modeling literature (114–116). In addition to the aforementioned global fit indices, we checked for localized areas of model misfit based on examination of the residual correlations (121), with residuals greater than 0.1 indicating areas of potentially significant misfit and/or violations of local independence (122).
Confirmatory bifactor models were further interrogated with the calculation of several model-based coefficients (123–125) including (a) coefficient omega total (wT), a measure of the reliability of the multidimensional TAS-20 total score, (b) coefficient omega hierarchical (wH), a measure of general factor saturation (i.e., the proportion of total score variance attributable to the general factor), (c) coefficient omega subscale (wS), a measure of the reliability for each individual subscale, (d) coefficient omega hierarchical subscale (wHS), a measure of the proportion of subscale variance attributable to the specific factor, (e) the explained common variance (ECV; the ratio of general factor variance to group factor variance) for the total score and each item separately, and (f) the percentage of uncontaminated correlations (PUC), a supplementary index used in tandem with total ECV to determine whether a scale can be considered “essentially unidimensional” (124,126). Omega coefficients calculated in the current study were based on the categorical data estimator proposed by Green and Yang (127). ECV coefficients were also calculated for individual subscales (S-ECV) as an additional measure of subscale general factor saturation.
Item Response Theory and Differential Item Functioning Analyses
After selecting an appropriate factor model, we evaluated the ECV and PUC coefficients to determine whether the model could be reasonably well-approximated by a unidimensional item response theory (IRT) model. We then fit the data from the TAS items included in the best-fitting factor model to a graded response model (128) in our SPARK sample using maximum marginal likelihood estimation (129), as implemented in the mirt R package (130). Model fit was assessed using the limited-information C2 statistic (131,132), as well as C2-based approximate fit indices and SRMR. Based on previously-published guidelines (133), we defined values of CFIC2 > 0.975, RMSEAC2 < 0.089, and SRMR < 0.05 as indicative of good model fit. Residual correlations were examined to determine areas of local dependence, with values greater than ±0.1 indicative of potential misfit. Items with multiple large residual correlations were flagged for removal, and the IRT model was then re-fit and iteratively tested until all areas of local misfit were removed.
After refining the unidimensional TAS model in the SPARK sample, we further investigated the same model in the HPP sample. Once a structural model was found to fit in both samples, we fit a multi-group graded response model to the full dataset, using this model to examine I-DIF between groups. I-DIF was tested using the iterative Wald test procedure proposed by Cao et al. (134) and implemented in R by the first author (135), with Benjamini-Hochberg–corrected (136) p values < 0.05 used to flag items for I-DIF. Significant omnibus Wald tests were followed up with tests of individual item parameters to determine which parameters significantly differed between groups (137). Notably, this I-DIF procedure is quite powerful in large sample sizes, potentially revealing trivial group differences, and thus I-DIF effect-size indices were used to determine whether the differential functioning of a given item was small enough to be ignorable in practice. In particular, we used the weighted area between curves (wABC) as a measure of I-DIF magnitude, with values greater than 0.30 indicative of practically significant I-DIF (84). We additionally reported the expected score standardized difference (ESSD), a standardized effect size interpretable on the metric of Cohen’s d (83). Items exhibiting practically significant I-DIF between autistic and non-autistic adults were further flagged for removal, and this process was repeated iteratively until the resulting TAS short form contained no items with practically significant I-DIF by diagnostic group. The total effect of all I-DIF (i.e., differential test functioning [DTF]) was then estimated using the unsigned expected test score difference in the sample (UETSDS), the expected absolute difference in manifest test scores between individuals of different groups possessing the same underlying trait level (84).
After removing items based on between-group I-DIF, we then examined I-DIF of the resulting short form across subsets of the autistic population. Using the same iterative Wald procedure and effect size criteria as the between-group analyses, we tested whether TAS items functioned differently across groups based on sex, gender, age (>30 vs. ≤30 years), race (non-Hispanic White vs. Other), level of education (any higher education vs. no higher education), age of autism diagnosis (≥18 years old vs. <18 years), self-reported co-occurring conditions (current depressive disorder, current anxiety disorder, and lifetime attention deficit hyperactivity disorder [ADHD]). Although many fewer stratification variables were collected in the HPP sample, I-DIF was also examined within that sample according to age (>30 vs. ≤30 years), sex, and phase of the project (i.e., pilot study vs. multi-site study). These I-DIF results were used to further refine the measure such that the resulting TAS short form exhibited I-DIF across all groups that was small enough to be practically ignorable. All items retained in the TAS form at this stage were incorporated into the final measure.
Once the TAS short form was finalized, we then fit an additional multi-group graded response model on only those final items, constraining item parameters to be equal between groups and setting the scale of the latent variable by constraining the general population sample to have a mean of 0 and standard deviation of 1. Using this model, we then estimated maximum a-posteriori (MAP) TAS latent trait scores for each individual, which were interpretable as Z-scores relative to the general population (i.e., a score of 1 is one full standard deviation above the mean of our non-autistic normative sample). Individual reliability coefficients were also examined, with values greater than 0.7 being deemed sufficiently reliable for interpretation at the individual level.
Validity Testing
To further test the validity of the newly generated TAS latent trait scores in autistic adults, we investigated the relationships between these scores and a number of clinical variables that have previously demonstrated relationships with alexithymia in either autistic adults or the general population. Based on previous literature (59), we hypothesized that alexithymia would show moderate to strong positive correlations with neuroticism (IPIP-N10), autistic traits (SRS-2), repetitive behavior (RBS-R), depression (BDI-II), generalized anxiety (GAD-7), social anxiety (BFNE-S), suicidality (BDI item 9), and somatic symptom burden (PHQ-15), as well as moderate negative correlations with autism-specific QoL (ASQoL). Given the documented relationships between neuroticism and alexithymia, we further examined the magnitude of these correlations after controlling for levels of neuroticism. We additionally examined relationships between alexithymia scores and demographic variables, including age, sex, race/ethnicity, age of autism diagnosis, and level of education. Notably, alexithymia is correlated with older age, male sex, and lower education level in the general population (138–140), and we expected that these relationships would replicate in the current SPARK sample (with the exception of the correlation with age, given the restricted age range in our current sample). We did not, however, expect to find significant associations between alexithymia and race/ethnicity or age of autism diagnosis.
Relationships between alexithymia and external variables were examined using robust Bayesian variants of the Pearson correlation coefficient (for continuous variables, e.g., SRS-2 scores), polyserial correlation coefficient (for ordinal variables, such as the BDI-II suicidality item and education level), partial correlation coefficient (when testing relationships after controlling for neuroticism), and unequal-variances t-test (141–143), as implemented using custom R code (144) and the brms package (145). Additional technical details regarding model estimation procedures and prior distributions can be found in the Supplemental Methods. Standardized effect sizes produced by these methods (i.e., r, rp, and d) were summarized using the posterior median and 95% highest-density credible interval (CrI).
In addition to estimating the magnitude of each effect size, we tested these effects for “practical significance” (146) within a Bayesian hypothesis testing framework. To do this, we defined interval null hypotheses within which all effect sizes were deemed too small to be practically meaningful. This interval, termed the region of practical equivalence (ROPE) (147), was defined in the current study as the interval d = [-0.2, 0.2] for t-tests, r = [-0.2, 0.2] for bivariate correlations, and rp = [-0.1, 0.1] for partial correlations. Evidence both for or against this interval null hypothesis can be quantified by calculating the ROPE Bayes factor (BFROPE), which is defined as the odds of the prior effect size distribution falling within the ROPE divided by the odds of the posterior effect size distribution falling within the ROPE (148,149). In accordance with standard interpretation of Bayes factor values (150,151), we defined BFROPE values greater than 3 as providing substantial evidence for (i.e., the true population effect lies outside the ROPE) and BFROPE values less than 0.333 as providing substantial evidence for (i.e., the true population effect lies within the ROPE and thus is not practically meaningful).
Values of BFROPE between 0.333 and 3 are typically considered inconclusive, providing only “anecdotal” evidence for either or (150).
Readability Analysis
As a supplemental analysis, we evaluated the readability of the TAS-20 and the newly-derived short form using the FORCAST formula (152). This formula is well-suited for questionnaire material, as it ignores the number of sentences, average sentence length, or hard punctuation (standard metrics for text in prose form), instead focusing exclusively on the number of monosyllabic words (153). FORCAST grade level equivalent was calculated for both the TAS-20 (excluding the questionnaire directions) and the short form derived in the current study.
Additionally, in order to compare our results with prior work on the readability of the TAS-20, we calculated the Flesch-Kincaid Grade Level (FKGL) and Flesch Reading Ease (FRE) scores (154,155) for both the TAS-20 and short form. All readability analyses were conducted using Readability Studio version 2019.3 (Oleander Software, Ltd, Vandalia, OH, USA). Although we did not attempt to select items based on readability, this analysis was constructed to ensure that shortening of the TAS questionnaire did not substantially increase the reading level, thereby making the short form measure less accessible to younger or less educated respondents.