Through RSA, we showed that the emotions represented in three different modalities, i.e., visual, linguistic, and visio-linguistic modalities, share commonalities in relational structure with modality-specific variations (Experiment 1). We also observed that in individual brain regions, emotion representations derived from BOLD signal changes associated with emotional experiences showed different degrees of similarity to the emotion representations from visual, linguistic, and visio-linguistic modalities (Experiment 2). Furthermore, we explored the extent to which emotion representations based on facial expressions correspond to those of other modalities through prediction analysis using a rigid transformation technique and evaluation of transfer performance using an ANN trained on facial expression discrimination (Experiment 3). The results of the present study reveal three key points. 1) The representational relationships between emotions calculated from the same modality but using different methods are similar, although this similarity diminishes to some extent across different modalities. 2) The representational relationships of visual emotion and linguistic emotion show relatively strong correlations with neural responses in posterior and anterior brain regions, respectively. The representational relationships of visio-linguistic emotion are the most similar to neural responses across the entire brain regions. 3) The representational relationships between emotions across different modalities have similar structures to the extent that they can be linearly mapped onto each other.
Although different modalities exhibited distinctive features in emotion representation, these representations showed a degree of similarity, enabling linear mapping between them. Our results indicate that the topology of emotion representation is more or less preserved across different modalities, suggesting a certain consistency between emotions expressed though facial expression and those conveyed through natural language. These findings are consistent with our daily life experience in which there is little discrepancy between emotions conveyed through different modalities.
Previous research reported that various brain regions process distinct emotion categories, and that the processing of a particular emotion category involves a distributed brain region network rather than localized one-to-one correspondence between specific brain areas and emotion categories [18, 27, 28]. Our study aimed to use RSA to test whether different brain regions represent emotional experiences in distinct ways corresponding to different modalities by comparing RSMs based on brain activities with those calculated from various datasets. The most crucial finding of this study is that individual brain areas activate differently depending on emotion categories, but also with varying degrees of similarity across different modalities.
The overall similarity between the representation of emotions in brain regions and that in visio-linguistic modalities, which involve multi-modal representations of both visual and linguistic information, was higher than the single modality representation of either visual or linguistic information. Numerous previous studies investigated brain regions involved in multi-modal emotion representations across various modalities [29–31], and these studies have consistently highlighted the contributions of areas such as TPJ, PC, STS, medial prefrontal cortex (MPFC), and OFC to modality-independent emotion representations. In our study, several regions, including TPJ, PC, STS, and MPFC, exhibited a high correlation between the brain emotion RSM and visio-linguistic emotion RSM. Our finding, which is in line with previous literature, supports the concept that these areas are involved in modality-independent processing of emotional expressions.
To the contrary, our results showed that the OFC, which was previously reported to be involved in multi-modal emotion representations [29], did not exhibit a high correlation with the emotion RSM of any modality. This discrepancy could be attributed to differences in analytical approach. Chikazoe et al. [29] analyzed OFC activity on the basis of emotions described using affective dimensions such as positive or negative valence, while the data we used from Horikawa et al. [18] involved an analysis based on emotion categories. Psychophysical studies [3, 13] and a neuroscientific approach [18] have shown that human emotions are better explained by emotion categories than by affective dimensions. However, OFC activity patterns are reported to exhibit less correlation with evaluations based on emotion categories, suggesting that the OFC may process emotion attributes on the basis of coarse affective reactions, such as pleasant and unpleasant, rather than fine-grained emotion categories (see Figure S1A in Horikawa et al. [18]).
In our study, several regions associated with visual information processing, such as the posterior areas and IPL, showed a tendency for high representational similarity with the visual modality. However, the brain regions other than posterior areas showed a tendency for high representational similarity with the linguistic modality. Previous research reported that the IPL processes emotional expressions specific to facial expressions [32]. Therefore, our observation of the highest correlation with the visual modality in the IPL (compared with the other modalities) may be contingent on the emotional impressions conveyed by the facial images in the video stimuli. To explore this possibility, we conducted an additional analysis similar to Experiment 2, dividing the video stimuli into two sets according to whether the video included a human face or not (with and without face conditions). However, in this experiment there was no significant difference between videos with and without face conditions (refer to Figure S4). This supplementary result suggests that although the representation of emotion in the activity of the IPL is highly correlated with the representation of emotion through facial expressions, it may depend on factors beyond reading emotions from facial expressions themselves. It is also important to note that the absolute correlation coefficients between the IPL’s brain activity RSM and emotion RSMs from all modalities are relatively low. Further refined experiments are necessary to conclusively determine whether the processing of emotions in the IPL is solely attributable to facial expressions or whether it involves additional factors.
This study is in accord with psychological reports and neuroscientific reports asserting that human emotions are represented as multi-dimensional vectors categorized into distinct dimensions [3, 4, 13, 14, 16–18, 31]. Our study is valuable in that it demonstrates through a machine learning approach that emotion representations can be linearly transferable across different modalities. This indicates a certain degree of homogeneity in the relationships between emotion representations across different modalities, reflecting a property essential for establishing semantic correspondences between emotions, even when they are expressed through different modalities, and which is crucial for facilitating smooth communication in daily life.
The high correlation between the visio-linguistic emotion RSM and brain emotion RSM can be attributed to at least two factors other than the use of multi-modal representation. First, there is a possibility that transformer models, including attention mechanisms, i.e., the backbone of the pretrained CLIP model used for calculating visio-linguistic representation, are well-suited to the extraction of sematic representation from training data in general. Second, since the CLIP model was trained with 400 million pairs of images and text, the use of large-scale training data might be critical to capture a diverse range of precise emotion expressions. To examine the first factor, we performed a similar analysis to that in Experiment 2, using a transformer model trained only on text data and available as Go emotions [33] to extract linguistic emotion vectors (refer to Figure S5 in Supplemental information). Figure S5 indicates that the correlation map calculated using Go emotions is similar to those of ConceptNet and Word2Vec, and that the overall average correlation coefficients across diverse brain regions are comparable with those of other linguistic modality conditions. This suggests that the high correlation between the visio-linguistic emotion RSM and brain emotion RSM cannot be solely attributed to the use of a transformer model for calculating the emotion RSM. Instead, the importance lies in conducting multi-modal learning with large-scale data to acquire emotion representation that is highly similar to that in human brain. Our finding suggests that emotional experiences are represented differently in each brain region, with varying degrees of similarity across different modalities, and that they may be cross-modally conveyable through different modalities.